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PREFACE 


The President of the United States approved the 
Space Shuttle program in 1972, to become the 
heart of the National Space Transportation System 
(NSTS) and provide routine, economical access to 
space. The launch of Columbia in 1981 — the first 
reusable vehicle to be launched and orbit the 
earth — opened a new era. The development of the 
Space Shuttle and its operation and maintenance 
have involved several National Aeronautics and 
Space Administration (NASA) centers, their indus- 
trial prime contractors, and scores of subcontrac- 
tors, including tens of thousands of people. This 
must be considered one of the most complex 
technical undertakings of all time. 

After 24 successful Shuttle flights, the Space 
Shuttle Challenger accident of January 28, 1986, 
stunned the entire nation and indeed the world. In 
response to the accident President Reagan estab- 
lished the Presidential Commission on the Space 
Shuttle Challenger Accident (frequently called the 
Rogers Commission, after its chairman) to inves- 
tigate the accident and make recommendations for 
the safe recovery of the Space Transportation 
System (STS). Among its recommendations, the 
Rogers Commission called upon NASA to review 
certain aspects of its STS risk assessment effort and 
to “identify those items that must be improved 
prior to flight to ensure mission success and flight 
safety.”"' It further recommended that an audit 
panel be appointed by the National Research Coun- 
cil (NRC) to verify the adequacy of the effort and 
report directly to the Administrator of NASA. The 
Committee on Shuttle Criticality Review and Haz- 
ard Analysis Audit was established in response to 
the recommendation. Beginning with the Commit- 
tee’s first meeting on September 22, 1986, this 
report is the culmination of 14 months of investi- 
gation, study, and deliberation. 

While the Committee recognizes that it is not 
possible, a priori, to guarantee mission success and 
flight safety, we hope the Committee’s conclusions 
and recommendations will assist NASA in taking 
those prudent additional steps which will provide 
a reasonable and responsible level of flight safety 
for the Space Shuttle. As the Challenger accident 
made painfully obvious, no probe into space is 


* Report to the President by the Presidential Commission on the Space 
Shuttle Challenger Accident, William P. Rogers, Chairman (June 
1 9X6). 
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routine, and the Space Shuttle is still a develop- 
mental vehicle. The risks of space flight must be 
accepted by those who are asked to participate in 
each flight as well as by those who are responsible 
to the nation for achieving its goals in space. Such 
risks should also be recognized by Executive Branch 
officials and Congress in their review and oversight 
of NASA endeavors. 

The Committee has been favorably impressed by 
the dedicated effort and beneficial results obtained 
thus far by NASA and its contractors from the STS 
risk assessment and risk management system. I he 
Committee is also gratified by the progress NASA 
is making in strengthening this system. We appre- 
ciate the close collaboration the Committee had 
with NASA and contractor personnel, the interest 
they showed, and their responsiveness to the Com- 
mittee’s suggestions. Nevertheless, although our 
general impressions are favorable, we do have 
suggestions for improvement. It is against this 
background that the recommendations in this re- 
port should be judged. 

The Committee recognizes that the NSTS risk 
assessment and risk management activities, both 
existing and with the modifications proposed here, 
are large and complex. This means that change 
should be introduced with care. A systematic ex- 
amination of the entire set of processes supporting 
risk assessment and management in order to op- 
timize the total ensemble may be appropriate. Such 
an examination may be particularly useful in con- 
junction with implementation of a new program 
such as the Space Station. 

Although this report and its recommendations 
are directed to the NSTS Program, they are of 
broader applicability. It certainly would be wise to 
consider the lessons learned when structuring any 
risk assessment and management system for other 
programs having attributes similar to the NSTS 
Program, such as the Space Station Program. It, 
too, is a large program involving highly complex 
technology which requires the major participation 
of several NASA centers and prime contractors for 
its execution. 

A c k n owledgmen ts 

In conducting its work, the full Committee met 
an average of once a month for over a year, and 
individual and groups of members conducted ad- 



ditional site visits, research, and writing on behalf 
of the Committee. This intense dedication and the 
resulting contributions of the highly competent 
members of the Committee are acknowledged with 
great appreciation. I also would like to express the 
Committee’s appreciation for the excellent support 
of the National Research Council staff in all aspects 
of its work. While this report represents the con- 
tributions by and deliberations of all members of 
the Committee, I would especially like to note the 
contributions to its writing by David S. Johnson 
and Courtland S. Lewis. Mr. Johnson, in particular, 
was extraordinarily effective as Study Director. His 
organizational skills, technical knowledge, and hard 
work were central to our effectiveness as a com- 
mittee. The peer review by the National Research 


Council also made a key contribution to the quality 
of our reports. 

In closing, we wish to thank the many NASA 
and contractor employees who facilitated the work 
of the Committee, often extending their already 
heavy workloads in the aftermath of the Challenger 
accident. Of special note was the assistance pro- 
vided during the study by the two NASA liaison 
persons, E. William Land, Jr. and Charles S. Harlan. 

Alton D. Slay 
Chairman, 

Committee on Shuttle Criticality 
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Executive Summary 


The Shuttle Criticality Review and Hazard Analysis 
Audit Committee (SCRHAAC) was formed by the 
National Research Council (NRC), at the request 
of the National Aeronautics and Space Adminis- 
tration (NASA), in response to a recommendation 
of the Presidential Commission on the Space Shuttle 
Challenger Accident (also known as the Rogers 
Commission). That Commission had recommended 
that NASA review and evaluate certain aspects of 
its process for ensuring the safety of the National 
Space Transportation System (NSTS), and that an 
NRC panel be appointed to audit the NASA review 
effort and verify its adequacy. 

The Committee monitored the overall NASA 
review and evaluation effort while performing 
detailed on-site reviews of its implementation for 
selected elements and subsystems 1 (e.g., the Space 
Shuttle Main Engine, Solid Rocket Booster, Aux- 
iliary Power Unit). As areas of particular concern 
emerged, such as software issues, the adequacy of 
Orbiter structural margins, integrated Space Trans- 
portation System (STS) analysis in support of risk 
assessment, and Orbiter steering on landing, the 
Committee pursued those concerns in greater detail. 
Various operational issues affecting Shuttle safety 
(e.g., the application of Launch Commit Criteria 
and the “cannibalization” of spare parts) were also 
examined. Each of these audits was conducted 
through a series of meetings with NASA and 
contractor personnel on-site at the contractor fa- 
cilities and NASA centers, and by reviewing avail- 
able documentation. In addition, two NASA liaison 
persons provided direct input on questions raised 

1 There are four major flight “elements” in the Space Shuttle (Orbiter, 
Space Shuttle Main Engines, Solid Rocket Boosters, and External 
l ank), each of which is composed of several subsystems. 


by the Committee on an ongoing basis and provided 
substantial reports on certain points of concern. 

The Committee appreciates that NASA has ac- 
complished the design, development, verification, 
and certification of the STS utilizing a management 
approach and procedures that have been, in large 
part, most successful. The Committee also recog- 
nizes that the risk assessment and management 
recommendations made in this report will only be 
useful if they are introduced in rational, practical 
stages. The Committee believes, however, that the 
safety of continuing operations of the STS can be 
improved by creating an integrated risk assessment 
and management program which builds on the 
largely qualitative methods used previously. The 
totality of the recommendations, once such a system 
is implemented, should be extremely valuable in 
the accomplishment of the NSTS Program in the 
future, and should serve as a prototype for similar 
programs in NASA as well. 

During the course of its work, the Committee 
produced two interim progress reports to the Ad- 
ministrator of NASA in which more than a dozen 
recommendations and suggestions were made. Some 
of the concerns expressed in the interim reports 
have been resolved since the reports were presented; 
others remain at issue. All of the concerns identified 
in those reports are reflected in the Findings and 
Recommendations summarized in Section 1.3. 

1.1. NASA’S SAFETY POLICY AND PROCESS 

NASA policy regarding safety is established by 
the Administrator; its essence (as stated in NASA 
Policy Directive 1701.1) is to: 

“a. Avoid loss of life, injury of personnel, damage and 
property loss. 
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“b. Instill a safety awareness in all NASA employees and 
contractors. 

k ‘c. Assure that an organized and systematic approach is 
utilized to identify safety hazards and that safety is 
fully considered from conception to completion of all 
agency activities. 

“d. Review and evaluate plans, systems, and activities 
related to establishing and meeting safety requirements 
both by contractors and by NASA installations to 
ensure that desired objectives are effectively achieved.” 

Every manager thoughout the organization is re- 
sponsible for systematically identifying risks, haz- 
ards, or unsafe situations or practices, and for 
taking steps to assure adequate safety in the activ- 
ities and products under his supervision. Out of 
this broad policy framework are derived the more 
specific safety requirements that are implemented 
in successively greater detail down through Head- 
quarters, program, and project organizations at the 
NASA centers and contractors. The Committee 
finds that the basic documents setting forth these 
policies are complete and do establish a firm 
foundation for the NASA-wide safety program. 

Central to NASA’s analyses to ensure reliability 
of the Shuttle system is the Failure Modes and 
Effects Analysis (EMEA). EMEAs are performed 
on all STS flight hardware as well as Ground 
Support Equipment (GSE) which interfaces with 
flight hardware at the launch sites to identify 
hardware items that are critical to the performance 
and safety of the vehicle and the mission, and to 
identify items that do not meet design requirements. 
Each possible failure mode is identified and then 
analyzed to determine the resulting performance 
of the system and to ascertain the worst-case effect 
that could result from a failure in that mode. All 
the identified “critical items” are then categorized 
according to the worst-case effect of the failure on 
the crew, the vehicle, and the mission. If the worst- 
case effect is loss of life or vehicle, the item is 
categorized as Criticality 1 (1R if there are redun- 
dant units, and IS if it would result from the failure 
of a piece of ground support equipment). In the 
same manner. Criticality 2 and 2R are cases where 
loss of mission could result. 

The result of this classification is a “Critical 
Items List” (CIL) which includes for each item the 
rationale for its retention on the STS, thus requiring 
a waiver of the NASA policy against flying with 
such items present. The retention rationale is the 
primary input to NASA waiver decisions to fly the 
Shuttle, exposing the STS and its crew to the risk 


implicit in the use of the analyzed critical item. 
The retention rationale is used to justify accepting 
the design “as is,” in the Committee’s view; its 
audits of the NASA review process discovered little 
emphasis on creative ways to eliminate potential 
failure modes. 

The hazard analysis is another analytical tool 
used to identify and, if possible, resolve hazardous 
conditions that could develop while operating and 
maintaining STS hardware and software. Hazard 
analyses consider not only the failures identified in 
the EMEA process, but also other potential threats 
posed by the environment, crew-machine inter- 
faces, and mission activities. Identified hazards and 
their causes are analyzed to find ways to eliminate 
or control the hazard. A hazard is said to be 
“eliminated” when its source has been removed. 
A “controlled hazard” is one that has effectively 
been controlled by a design change, addition of 
safety or warning devices, procedural changes, or 
operational constraints. Any hazard that cannot 
feasibly be eliminated or controlled is termed an 
“accepted risk.” 

There are many other analysis and assessment 
tools used by NASA. This complex mosaic of 
analysis techniques is intended to provide an all- 
encompassing approach to ensuring the design 
reliability and safety of the STS. Some of the 
techniques, such as the hazard analyses, tend to be 
“top-down” approaches that examine certain cross- 
systems causes and effects. Others, such as EMEA/ 
CIL, are narrower “bottom-up” analyses that pur- 
sue a specific event to its conclusion — but only 
with respect to the subsystem involved. 

In March 1986, soon after the Challenger acci- 
dent, direction was issued within NASA to reeval- 
uate the FMEAs on all critical items on the STS, 
“. . . to affirm the completeness and accuracy of 
the FMEA/CIL for the current National STS de- 
sign.” Following reevaluation of the FMEA, each 
Criticality 1 and 1R item, along with any new 
items, or items for which the reevaluation had led 
to a change in classification, was to be resubmitted 
for review and approval of the waiver permitting 
the item to be flown aboard the STS. Those items 
not revalidated by the review were required to be 
redesigned, certified, and qualified for flight. In 
addition to the FMEA/CIL reevaluation, the direc- 
tives stipulated that the hazard analyses and a set 
of special Element Interface Functional Analyses 
(EIFAs) were also to be reviewed for completeness 
and accuracy. 
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Since the Challenger mission 51-L accident, a 
substantial number of engineering changes have 
been undertaken to improve Shuttle safety prior to 
resumption of flight. The redesign activity has, for 
the most part, preceded the FMEA/CIL and hazard 
analysis reevaluations. However, as the reevalua- 
tions proceeded, they disclosed a number of addi- 
tional items which are being addressed before the 
next flight. 

1.2 THE COMMITTEE’S VIEW 

As the Challenger accident made very evident, 
space flight is not routine. Its risks must be accepted 
by those who are asked to participate in each flight 
as well as by those who are responsible to the 
nation for achieving our goals in space. The Com- 
mittee believes that the basis for NASA’s acceptance 
of those risks should, as far as possible, stem from 
rationally derived criteria. This acceptance also 
should depend very heavily on the quality of the 
methodology and the degree of objectivity by which 
the risks are determined, as well as the rigor by 
which the risks are controlled (i.e., managed). 

Very early in the work of the Committee, it 
became clear that NASA’s processes for analyzing 
failure modes, effects, and hazards could only be 
understood and evaluated intelligently when viewed 
as elements of an overall program of risk assessment 
and risk management. In the Committee’s view, 
any such program should include the following 
basic elements: 

Risk assessment: 

— A comprehensive method for identifying po- 
tential failure modes and hazards associated with 
the system. 

— A specific, quantitative methodology for iden- 
tifying and assessing (or estimating) the safety risks 
of the system. 

Risk management: 

— A management process by which the safety 
risks can be brought to levels or values that are 
acceptable to the final approval authority. Risk 
management includes establishment of acceptable 
risk levels; the institution of changes in system 
design or operational methods to achieve such risk 
levels; system validation and certification; and 
system quality assurance. The basic organizational 


elements arc in place within NASA for assessing 
and managing risk; however, there is a need for a 
change in the scope of functions and the way that 
they are carried out. 

The Committee believes that the management of 
the risks of the STS must be the responsibility of 
line management (i.e., the NSTS Program Manager, 
the Associate Administrator for Space Flight and, 
ultimately, the Administrator of NASA). Only this 
program management, not the safety organizations , 
can make judicious use of the means available to 
achieve operational goals while controlling the 
safety risks at acceptable levels throughout the 
evolution of the program. The safety organizations 
at NASA centers and Headquarters are staff or- 
ganizations — as such, they can and should be 
responsible for providing assessments of the sys- 
tem’s risks. They should also be responsible for 
assuring that the activities associated with con- 
trolling the risks to the specified levels have been 
carried out and documented. Safety organizations 
cannot, however, assure safe operation. 

Certain shortcomings in process and methodol- 
ogy exist which are discussed in Section 5 and 
summarized in Section 1.3 below. In particular, 
there is a fundamental problem in the nature of 
and the methods used to develop the overall as- 
sessments on which NASA line management bases 
its decisions about how to reduce and control risk 
in the STS. 

Risks in STS operations now are assessed based 
on subjective judgments and accepted on the basis 
of qualitative rationales, although many quantita- 
tive engineering analyses and test data relevant to 
risk assessment arc available and often are used in 
arriving at what are finally qualitative, subjective 
judgements. With such a non-specific (i.e., non- 
value based) risk acceptance process there is little 
basis for making objective comparisons of the 
several major risk categories associated with the 
STS, nor for carrying out risk evaluations by 
independent agencies. Neither can one systemati- 
cally track the efforts to reduce the risk or impact 
of the various possible failures. Without more 
objective, quantifiable measures of relative risk it 
is not clear how NASA can expect to implement a 
truly effective risk management program. However, 
the Committee does not wish to suggest that NASA 
subordinate sound technical judgement to numer- 
ical analysis. Such an approach would be, in our 
opinion, unrewarding and counterproductive. 


.3 



1.3 FINDINGS AND RECOMMENDATIONS 

Following arc the major findings of the Com- 
mittee and the specific recommendations associated 
with them. The summary findings and recommen- 
dations are extracted from Section 5 of the report, 
which includes a discussion of each one. The 
subsection numbering here parallels that in Section 
5. For example, Subsection 1.3.1 corresponds to 
Subsection 5.1, 1.3.2 corresponds to 5.2, and 

1.3. 9.1 corresponds to 5.9.1. In addition, the rec- 
ommendations are numbered sequentially and iden- 
tically in both sections. It should be noted that the 
recommendations are not listed in any priority 
order. 

1.3.1 Critical Items List Retention Rationale Review 
and Waiver Process 

The Committee views the NASA critical items 
list (CIL) waiver decision making process as being 
subjective, with little in the way of formal and 
consistent criteria for approval or rejection of 
waivers. Waiver decisions appear to be driven 
almost exclusively by the design-based FMEA/CIL 
retention rationale, rather than being based on an 
integrated assessment of all inputs to risk manage- 
ment. The retention rationales appear biased to- 
ward proving that the design is “safe,” sometimes 
ignoring significant evidence to the contrary (see 
Section 5.1). 

Although the Safety, Reliability, and Quality 
Assurance (SR&QA) 2 organizations of NASA col- 
lect, verify, and transmit all data related to FMEA/ 
CIL and hazard analysis results, the Committee 
has not found an independent, detailed analysis or 
assessment of the CIL retention rationale which 
considers all inputs to the risk assessment process. 

Recommendations (1): 

The Committee recommends that NASA estab- 
lish an integrated review process which provides a 
comprehensive risk assessment and an independent 
evaluation of the rationale justifying the retention 
of Criticality 1 and 1 R items. This integrated review 
should include detailed consideration of the results 
of hazard analyses and all other inputs to the risk 


J As of September 1987, the NASA Headquarters organization is 
called Safety, Reliability, Maintainability , and Quality Assurance 
(SRM&QA), while the similar organizations at the NASA centers are 
still named SR&QA. In this report, SR&QA also is used to refer 
generically to this function. 


assessment process, in addition to the FMEA/CIL 
retention rationale. Further, the review process 
should assure that the waivers and supporting 
analyses fully reflect current data and designs. 
Finally, NASA should develop formal, objective 
criteria for approving or rejecting proposed critical 
item waivers. 

1.3.2 Critical Items List Prioritization and Disposition 

At present, in NASA instructions all Criticality 
1 and 1R items are formally treated equally, even 
though many differ substantially from each other 
in terms of the probability of failure or malper- 
formance, and in terms of the potential for the 
worst-case effects postulated in the FMEA to be 
seen if the particular failure occurs. 

The large number of Criticality 1 and 1R items 
at the time of the 51— L accident has since been 
substantially increased due to changes in ground 
rules for classification and the complete reevalua- 
tion of the entire STS. 

The Committee believes that giving equal man- 
agement attention to all Criticality 1 and 1R 
potential failures could be detrimental to safety if, 
as is the case, some are extremely unlikely to occur, 
or if the probability is very low that the postulated 
worst-case consequences of the failures will result. 
Treating all such items equally will necessarily 
detract from the attention senior management can 
give to the most likely and most threatening failure 
modes. 

Recommendations (2): 

The Committee recommends that the formal 
criteria for approving waivers include the proba- 
bility of occurrence and probability that the worst- 
case failures will result. We further recommend 
that NASA establish priorities now among Criti- 
cality 1 and 1R items, taking care not to use 
ambiguous measures of risk and probability. NASA 
should also modify the definitions of criticality in 
terms of the probability of failure and probability 
of worst-case effects. Finally, we recommend that 
NASA Level I management pay special attention 
to those items identified as being of highest priority, 
along with the rationale that produced the priority 
rating. Responsibility for attending to lower-prior- 
ity items within the present Criticality 1 and 1R 
categories, when reclassified, should be distributed 
to Levels II and III for detailed evaluation and 
decision. 
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1.3.3 Hazard Analysis and Mission Safety Assessment 

NASA hazard analyses currently do not address 
the relative probabilities of a particular hazardous 
condition arising from failure modes, human errors, 
or external situations. 

The hazard analysis and the mission safety as- 
sessment do not: address the relative probabilities 
of the various consequences which may result from 
hazardous conditions; provide an independent eval- 
uation of the retention rationales stated in the input 
CILs; or provide an overall risk assessment on 
which to base the acceptance and control of residual 
hazards. 

Recommendations (3): 

The Committee recommends that the FMEA/ 
CILs be used as one of many inputs considered in 
the hazard analysis and system safety assessment. 
We also recommend that the overall system safety 
assessment encompass a quantitative risk assess- 
ment which in turn uses the CILs and hazard 
analyses as input. Finally, the Committee recom- 
mends that this risk assessment be the primary 
basis for retention or rejection of residual hazards 
as well as critical items. 

1.3.4 Relationship of Formal Risk Assessment Process 
to Space Transportation System Engineering Changes 

Elements of formal risk assessment, such as 
FMEA/CILs and hazard analyses (HAs), appear to 
have had little direct impact on the STS recovery 
engineering process, as they have not figured prom- 
inently in the majority of engineering change de- 
cisions made by NASA management. 

Recommendation (4): 

The Committee recommends that NASA take 
firm steps to ensure a continuing and iterative 
linkage between the formal risk assessment process 
(e.g., FMEA/CIL and HA) and the STS engineering 
change activities. 

1.3.5 Timely Feedback of Data Into the Risk 
Assessment and Management Processes 

The Committee has found many indications that 
data from STS inspection, test and repair, and 
inflight operations do not always feed back rapidly 
enough or effectively enough into the risk assess- 
ment and management processes. 


Recommendations (S): 

The Committee recommends that high-level NASA 
management attention and priority be given to 
increasing the efficiency of the flow, analysis, and 
use of inspection, test and repair, test results, and 
in-flight operations data throughout the decision- 
making process. The Committee also recommends 
that full implementation of the System Integrity 
Assurance Program (SIAP), including its Program 
Compliance Assurance Status System (PCASS), be 
given a high priority. Diverse professionals (e.g., 
design and development engineers, operating per- 
sonnel, statistical analysts) should be used in the 
development of this program, with maximum pos- 
sible early involvement by potential users and key 
decision makers. The Committee further recom- 
mends that procedures be implemented to ensure 
that all mission anomalies detected in real time and 
from recorded events, and those detected during 
the near-term inspection of recovered hardware, 
also are fed into the formal risk assessment and 
management processes for action prior to commit- 
ting to the next flight. Finally, the Committee 
recommends that all such anomalies he called to 
the immediate attention of launch decision makers 
who will justify in writing their decisions regarding 
the disposition of the anomalies. 

1.3.6 The Need for Quantitative Measures of Risk 

Quantitative assessment methods, such as prob- 
abilistic risk assessment, have not been used directly 
to support NASA decision making regarding the 
STS, although quantitative analyses and test data 
often are used in arriving at qualitative, subjective 
judgments upon which decisions are based. Pow- 
erful methods of statistical inference are now avail- 
able which allow the integration of all sources of 
information on risk, including data on partial 
degradations and failures as well as engineering 
models of failure modes. 

NASA is not adequately staffed with specialists 
and engineers trained in the statistical sciences to 
aid in the transformation of complex data into 
information useful to decision makers, and for use 
in setting standards and goals. 

Recommendations (6): 

The Committee recommends that probabilistic 
risk assessment approaches be applied to the Shuttle 
risk management program at the earliest possible 
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date. Data bases derived from STS failures, anom- 
alies, and flight and test results, and the associated 
analysis techniques, should be systematically ex- 
panded to support probabilistic risk assessment, 
trend analyses, and other quantitative analyses 
relating to reliability and safety. Although the 
Committee believes that probabilistic risk assess- 
ment approaches will greatly improve NASA’s risk 
assessment process, it recognizes that these ap- 
proaches should not substitute for good engineering 
and quality control practices in design, develop- 
ment, test, manufacturing, and operations, all of 
which must continue to receive high priority em- 
phasis by NASA and its contractors. The Com- 
mittee further recommends that NASA build up its 
capability in the statistical sciences to provide 
improved analytical inputs to decision making. 

1.3.7 The Need for Integrated Space Transportation 

System Engineering Analysis in Support of Risk 
Management 

NASA safety-related analyses tend to focus pri- 
marily on single-event, worst-case failures to the 
relative exclusion of possible multiple and syner- 
gistic failures in different subsystems or elements 
of the STS. In addition, the connection between 
the various analyses appears tenuous. There does 
not appear to be an adequate integrated-system 
view of the entire STS. 

Recommendation (7): 

A “top-down” integrated system engineering 
analysis, including a system safety analysis, that 
views the sum of the STS elements as a single 
system should be performed to help identify any 
gaps that may exist among the various “bottom- 
up” analyses centered at the subsystem and element 
levels. 

1.3.8 Independence of the Space Transportation 

System Certification and Software Validation and 
Verification Program 

In general, hardware certification and verifica- 
tion, and software validation and verification 3 in 
STS are managed and conducted primarily by the 
same organizational elements responsible for the 
design and fabrication of the units. Thus, the 


' See Appendix A for definition of these terms. 


independence of the certification, validation, and 
verification processes is questionable. For example: 

— The contractor that builds the Orbiters (Rock- 
well International, STS Division) is also responsible 
for preparing the documentation and performing 
the work involved in certification, but does not 
answer to an entity independent of the NSTS 
Program with regard to the certification function. 

— At Marshall Space Flight Center (MSFC), the 
Engineering Directorate has the prime responsibil- 
ity for design requirements for the propulsion 
elements of STS and also has responsibility for the 
review and approval of their certification. The 
Program Office is responsible for the design and 
development phase as well as for performing the 
certification activities. 

— At the Johnson Space Center (JSC), prime 
responsibility for design requirements, design and 
development, and certification for the Orbiter all 
rest with the Program Office, supported by the 
Engineering and Operations Directorates of the 
Center. 

— “Independent” validation and verification 
(IVScV) of software is carried out by the same 
contractor (IBM) that produces the STS software, 
with some checks being made by the Johnson Space 
Center (JSC). 

Recommendation (8): 

Responsibility for approval of hardware certifi- 
cation and software IV&V should be vested in 
entities separate from the NSTS Program structure 
and the centers directly involved in STS develop- 
ment and operation. However, these organizations 
should continue to conduct activities supporting 
certification and IV&V. 

1.3.9 Operational Issues 

1. 3.9.1 Launch Commit Criteria Waiver Policy 

An average of two Launch Commit Criteria 
(LCCs) are waived by NASA in the course of each 
launch. The Committee questions the validity of 
an operational procedure that “institutionalizes” 
waivers by routinely permitting established criteria 
to be violated. 

Recommendation (9a): 

The Committee recommends that NASA estab- 
lish a list of mandatory LCCs which may NOT be 



waived by anyone. This should comprise the bulk 
of the LCCs. A limited number of criteria would 
be separately listed, for special cases, together with 
a discussion of the circumstances under which they 
may be waived and who may make the waiver 
decision. 

1.3.9. 2 Human Factors as a Contributor to Risk 

Human factors, which are considered in some 
of the STS hazard analyses, do not appear to be 
taken into account as the cause of failure modes 
in the FMEAs. Since the FMEA ts one of the 
principal safety tools used in the evaluation of the 
STS design, the Committee believes that the STS 
design process should explicitly consider and min- 
imize the potential contribution of humans to the 
initiation of the defined failure modes. 

R e commendation (9 b): 

The Committee recommends that the NASA 
FMEA include human factors among the recog- 
nized sources of potential causes of failure modes. 
This step would provide another valid link between 
the FMEA and the hazard analysis, which are now, 
in our view, too tenuously connected. 

/ .3.9.3 Cannibalization of Spare Parts 

By the time of the Challenger accident, “canni- 
balization, ^ ” the removal of parts at the Kennedy 
Space Center (KSC) from one operational STS 
element to fulfill spares requirements in another, 
had become a prevalent feature of STS logistics, 
thus introducing a variety of failure potentials 
associated with human error. Cannibalization is 
not evaluated as a producer of potential failure in 
either the hazard analysis (where it would be most 
appropriate) or the FMEA. 

Recommendations (9c): 

The Committee recommends that NASA main- 
tain its current intense attention toward reducing 
cannibalization of parts to an acceptable level. We 
further recommend that adequate funds for the 
procurement and repair of spare parts be made 
available by NASA to ensure that cannibalization 
is a rare requirement. Finally, we recommend that 
NASA include cannibalization, with its attendant 
removal and replacement operations, as a potential 
producer of failure in the integrated risk assessment 
recommended earlier (Section 1.3.1). 


1.3.10 Other Weaknesses in Risk Assessment and 
Management 

1.3.10.1 Ihe Apparent Reliance on Boards and 
Panels for Decision Making 

The multilayered system of boards and panels 
in every aspect of the STS may lead individuals to 
defer to the anonymity of the process and not focus 
closely enough on their individual responsibilities 
in the decision chain. The sheer number of STS- 
related boards and panels seems to produce a 
mindset of “collective responsibility/’ 

Recommendation ( 1 0a): 

The Committee recommends that the Adminis- 
trator of NASA periodically remind all NASA 
personnel that boards and panels are advisory in 
nature. He should specify the individuals in NASA, 
by name and position, who are responsible for 
making final decisions while considering the advice 
of each panel and board. NASA management 
should also sec to it that each individual involved 
in the NSTS Program is completely aware of his/ 
her responsibilities and authority for decision mak- 
ing. 

1.3.10.2 Adequacy of Orbiter Structural Safety 
Margins 

The primary structure of the STS has been 
excluded, by definition, from the FMEA/CIL proc- 
ess, based on the belief that there is an adequate 
positive margin of safety. However, the Committee 
questions whether operating structural safety mar- 
gins have actually been proven adequate. 

Completion of the Model 6.0 loads study and 
the reevaluation of margins of safety based on 
these loads will significantly improve NASA’s grasp 
of actual operating margins of safety. 

R eco m m endations (1 Ob ) : 

The Committee recommends that NASA place a 
high priority on completion of the Model 6.0 loads, 
the reevaluation of safety margins for these loads, 
and the early verification and continued monitoring 
of the model 6.0 loads by permanently instru- 
menting and calibrating at least the next full scale 
STS vehicle to fly. We further recommend that 
NASA complete and implement a comprehensive 
plan for conducting periodic inspection and main- 
tenance of the structure of the Orbiters throughout 
the service life of each vehicle. 
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1.3.10.3 Software Issues 

NASA FMEAs do not assess software as a 
possible cause of failure modes. 

There is little involvement of JSC Safety, Relia- 
bility, and Quality Assurance in software reviews, 
resulting in little independent quality assurance for 
software. 

A large amount of data — much of it flight spe- 
cific — must be loaded for each Shuttle mission but 
it is not subjected to validation as rigorous as that 
for the software. 

Recommendations (10c): 

The Committee recommends that NASA: explore 
the feasibility of performing FMEAs on software, 
including the efficacy of identifying and predicting 
fault and error modes; request JSC SR&QA to 
provide periodic review and oversight of software 
from a quality assurance point of view; provide 
for validation of input data in a manner similar to 
software validation and verification. 

1.3.10.4 Differences in Procedures Among NASA 
Centers 

Differences in the procedures being used by the 
main NASA centers involved in the NSTS Program 
may reflect an imbalance between the authority of 
the centers and that of the NSTS Program Office. 
The Committee is concerned that such an imbalance 
can lead to serious problems in large programs 
where two or more centers have major roles in 
what must be a tightly integrated program, such 
as the NSTS and Space Station. Without strong, 
central program direction and integration, the suc- 
cess and safety of these complex programs can be 
placed in jeopardy. 

Recommendation (lOd): 

The Administrator should ensure that strong, 
central program direction and integration of all 
aspects of the STS are maintained via the NSTS 
Program Office. 

1.3.10.5 Use of Non-Destructive Evaluation 
Techniques 

Non-destructive evaluation (NDE) tests on the 
Solid Rocket Motor (SRM) are performed at the 
manufacturing plant. Subsequent transportation 
and assembly introduce a risk of debonding and 


other damage which may not be apparent upon 
visual inspection. No NDE is done on the SRMs 
in the “stacked” configuration at the launch facility. 

New NDE techniques now being developed have 
potential applicability to the STS. 

Recommendation ( 1 Oe): 

The Committee recommends that NASA apply 
all practicable NDE techniques to the SRM at the 
launch facility, at the highest possible level of 
assembly (e.g., SRMs in the “stacked” configura- 
tion), and emphasize development of improved 
NDE methods. 

1.3.11 Focus on Risk Management 

The current safety assessment processes used by 
NASA do not establish objectively the levels of the 
various risks associated with the failure modes and 
hazards. 

It is not reasonable to expect that NASA man- 
agement or its panels and boards can provide their 
own detailed assessments of the risks associated 
with failure modes and hazards presented to them 
for acceptance. 

Validation and certification test programs are 
not planned or evaluated as quantitative inputs to 
safety risk assessments. Neither are operating con- 
ditions and environmental constraints which may 
control the safety risks adequately defined and 
evaluated. 

In the Committee’s view, the lack of objective, 
measurable assessments in the above areas hinders 
the implementation of an effective risk management 
program, including the reduction or elimination of 
risks. 

Recommendations (11): 

The Committee recommends that NASA con- 
sider establishing a focused agency-wide Systems 
Safety Engineering (SSE) function, at both Head- 
quarters and the centers, which would: 

— be structured so as to be integrally involved 
in the entire set of design, development, validation, 
qualification, and certification activities; 

— provide a full systems approach to the contin- 
uous identification of safety risks (not just failure 
modes and hazards) and the objective (quantitative) 
evaluation of such safety risks; 

— provide the output of this function to the 
NASA Program Directors in support of their risk 
management; and 
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— support the Program Directors by providing 
assurance that their systems are ready for final 
safety certification to the risk levels established by 
the NASA Administrator. 

The Committee also recommends that the STS 
risk management program, based in part on the 
definition of the potential to reduce the level of 
risk developed by the system safety risk assessment, 
include a concerted effort to remove or reduce the 
risks. 

1.4 CLOSING REMARKS 

Although this report and its recommendations 
are directed to the NSTS Program, most of them 
are of broader applicability. It would be wise to 
consider the lessons learned here when structuring 


a risk assessment and management system for other 
programs which have similar attributes, such as 
the Space Station. The safety of other large systems 
involving highly complex technology, and requiring 
major participation by several NASA centers and 
prime contractors, could benefit from an integrated 
risk assessment and management program based 
on the current NASA procedures supplemented by 
those recommended in this report. For any new 
program, such as the Space Station, there is the 
opportunity to structure an optimum risk assess- 
ment and management program at the outset by 
assembling those elements of risk assessment and 
management which will be most effective in estab- 
lishing, monitoring, and controlling safety risks to 
accepted levels. (See Section 6.) 
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2 Introduction 


“ Criticality Review and Hazard Analysis. NASA 
and the primary Shuttle contractors should 
review all Criticality 1, 1R, 2, and 2R items 
and hazard analyses. This review should iden- 
tify those items that must be improved prior 
to flight to ensure mission success and flight 
safety. An Audit Panel, appointed by the 
National Research Council, should verify the 
adequacy of the effort and report directly to 
the Administrator of NASA.” 


2.1 PURPOSE OF STUDY 

The Space Shuttle Challenger disaster of January 
28, 1987, stunned NASA and the entire nation. As 
the shock of the accident began to subside, NASA 
initiated a wide range of actions designed to ensure 
greater safety in various aspects of the Shuttle 
system and an improved focus on safety throughout 
the National Space Transportation System (NSTS) 
Program. A number of these actions were prompted 
by recommendations of the Presidential Commis- 
sion on the Space Shuttle Challenger Accident (also 
known as the Rogers Commission). 

Recommendation III of the Presidential Com- 
mission (see box above) directed NASA to review 
certain safety-critical items on the Shuttle as well 
as the existing analyses of hazards that could affect 
Shuttle operations and system safety, and to identify 
needed improvements in the Shuttle system. It also 
recommended the establishment of an audit panel, 
under the auspices of the National Research Coun- 
cil (NRC), to monitor that review effort and verify 
its adequacy. At NASA’s request, the NRC formed 
the Committee on Shuttle Criticality Review and 


Hazard Analysis Audit to conduct this audit. The 
Committee consisted of 12 people with expertise 
in a range of relevant areas: space system devel- 
opment and operations, aircraft development and 
operations, propulsion systems, avionics, struc- 
tures, statistics, reliability and safety, and risk 
assessment and management of complex techno- 
logical systems. They were asked to evaluate 
NASA’s effort in response to the Rogers Commis- 
sion recommendation and to report their findings 
and recommendations directly to the NASA Ad- 
ministrator. 

See Appendix B for the full text of the pertinent 
establishing documents. 

2.2 STUDY APPROACH 

2.2.1 Interpretation of Task 

Following its charge from the Rogers Commis- 
sion and NASA, the Committee planned initially 
to focus its audit strictly on certain specific features 
of the NASA safety process: 

• the Critical Items List (CIL) and the NASA 
review of those Shuttle primary and backup 
units whose failure might result in loss of life, 
the Shuttle vehicle itself, or the mission (i.e., 
the Criticality 1, 1R, 2 and 2R items 4 ); 

• the Failure Modes and Effects Analyses (FMEA) 
on which the criticality determinations are 
largely based; and 

• the hazard analyses and their review. 

(See Section 3 for a description of these activities 
and their interrelationships.) 

4 See Table 3-1 for definitions of Criticality levels. 
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Early in its study, the Committee recognized that 
to fulfill its charge to “verify the adequacy of the 
effort” it must broaden the scope of its audit to 
include an assessment, from a risk management 
point of view, of NASA’s overall process for 
identifying, assessing, reviewing, and implementing 
changes in the Space Shuttle system. That broader 
scope would include not only other safety analyses 
and functions, but also the relationship of safety 
elements and organizations to the continuing proc- 
ess of Space Shuttle design and engineering. (See 
Appendix B for the resulting Statement of Task.) 

Thus, in the context of evaluating NASA’s pro- 
cedures for detecting, assessing, and dealing with 
hazards and potential failure modes in the Shuttle 
system, the Committee would seek to determine: 

• What has NASA done in the past? 

• What is it doing differently now'? 

• How adequate are these procedures? 

• Where are the flaws in the process, if any? 

2.2.2 Plan and Structure 

The Committee began with a general review of 
NASA’s policies and procedures for reviewing safety- 
critical items and analyzing hazards. This process 
overview, provided in briefings by and discussions 
with NASA officials and managers of the NSTS 
Program and its component projects, provided not 
only a general overview but also the status of the 
reevaluation which NASA had undertaken of the 
FMEA/CIL and hazard analyses. The general re- 
view also included briefings and studies on the 
ways in w'hich other organizations and industries 
(e.g., U.S. Air Force, nuclear power, and commer- 
cial aviation) accomplish similar safety analyses 
and reviews. 

The Committee decided to conduct its audit of 
the reevaluation on several levels. First, it would 
conduct a detailed review of one or two major 
Space Transportation System (STS) elements 5 , and 
the reevaluation process and its results. The Space 
Shuttle Main Engine (SSME) and the Solid Rocket 
Booster/Solid Rocket Motor (SRB/SRM) were se- 
lected for this audit, since the Committee felt that 


NASA terminology generally refers to the entire Space Shuttle as a 
“system” composed of four major flight “elements”: (Arbiter, Space 
Shuttle Main Engines, Solid Rocket Boosters/Solid Rocket Motors, 
and Kxternal Tank. Each of these elements is composed of major 
systems which are, in turn, made up of subsystems, units, and 
components or piece parts. 


the greatest hazards are in propulsion. During its 
work, the Committee identified other areas of 
concern which led to a detailed examination of a 
number of different aspects of the STS safety- 
related activities. Each of these audits was con- 
ducted through a series of meetings with NASA 
and contractor personnel on-site at contractor 
facilities and NASA centers. 

Concern about the potential weakness of NASA’s 
“top-down” analyses to complement the “bottom- 
up” EMEA/CILs (which seemed to be the dominant 
safety evaluation tool) led the Committee to initiate 
audits related to the integrated system safety as- 
sessments across all of the elements of the STS. 
For example, it examined interactions arising from 
the generation and distribution of electrical power 
and fresh water aboard the STS, and the generation 
and distribution of hydraulic power in the Orbiter 
and the SRB. This work is reflected particularly in 
Section 5.7 of this report. 

The 17-inch diameter fuel and oxidizer discon- 
nect valves between the Orbiter and the External 
Tank (F.T) were selected for detailed examination 
of the preparation and role of hazard analyses in 
SIS risk assessment to complement the broader, 
more general treatment of this subject obtained in 
briefings, discussions, and written answers to Com- 
mittee questions. This audit contributed signifi- 
cantly to Sections 5.3 and 5.1 1. 

The Committee discovered early in its work that 
the large number of Criticality 1 and 1 R items on 
the STS arc not ranked by priority of their impor- 
tance and that NASA did not appear to be making 
much use of modern analytical techniques in quan- 
titatively assessing probabilities of failures and their 
effects, and levels of risk in the program. This led 
to a special investigation of the extent to which 
such techniques are used in the NSTS program, 
and of methods which might be of special value to 
the program. (See especially Sections 5.2 and 5.6, 
and Appendices D and E.) 

Since the STS structure was excluded by NASA 
from the FMEA/CIL process, and since there were 
concerns about the actual margins of safety, the 
Committee examined in some detail the past history 
and current activity of NASA in this critical area 
(see Section 5. 10.2). The safety/risk assessment for 
Orbiter software also is handled in a very different 
manner than hardware (e.g., no FMEA/CIL). 
Therefore, it too was subjected to a special audit, 
the results of which are reflected primarily in 
Sections 5.8 and 5.10.3. 
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Finally, because of significant problems in the 
past, the Committee examined in some detail, from 
a safety standpoint, the history and current redesign 
of the Orbiter nose wheel steering system, and the 
main wheels and brakes. 

These more detailed audits of selected subsys- 
tems, when coupled with the broader investigations 
of the SSME and SRB elements and the STS as a 
whole, provided the basis for the Committee’s 
findings, conclusions, and recommendations in Sec- 
tion 5 and supporting material in Appendices D 
through F. The Committee did not examine the 
interfaces between the STS and its payloads to the 
extent that the members were comfortable in mak- 
ing any specific conclusions and recommendations 
beyond those for the NSTS Program in general. 

2.2.3 Meetings and Site Visits 

Apart from the meetings and site visits conducted 
by individual and groups of Committee members, 
the full Committee held a total of 12 meetings. 
Nine meetings were largely fact-finding with NASA 
and contractor personnel; three were devoted to 
formulating conclusions and recommendations, and 
preparation of this final NRC report (see Table 
2-1). The Committee met with a large number of 
NASA personnel representing Headquarters man- 
agement, as well as program and project manage- 
ment at all three of the NASA field centers having 
primary involvement in the NSTS Program. Safety, 
Reliability, and Quality Assurance (SR&QA) 
organizations 6 were heavily represented among 
those presenting briefings and working with the 
Committee. Prime contractors for STS elements, 
and contractors for several subsystems and STS 
integration activities were also extensively repre- 
sented, both at NASA centers and at their own 
facilities. In addition, independent contractors in- 
volved in the FMEA/CIL reevaluation were heard 
from. 

In addition to the meetings and site visits, input 
was provided by NASA in two other very important 
ways. First, two NASA liaison persons representing 
Headquarters management and the NSTS Program 
(SR&QA Office) facilitated the Committee’s audit 
and provided direct input on specific questions on 


b As of September 1987, the NASA Headquarters organization is 
called Safety, Reliability, Maintainability , and Quality Assurance 
(SRM&QA), while the similar organizations at the NASA centers are 
still named SR&QA. In this report, SR&QA also is used to refer 
genetically to this function. 


an ongoing basis. Secondly, a series of documents 
were provided giving detailed answers to lists of 
questions developed by the Committee on a wide 
range of subjects. These “Q&A” documents were 
supplemented by substantial reports from NASA 
on certain points of concern. 

It should be noted here that the Committee was 
at all times impressed and gratified by the excellent 
support that was consistently provided by NASA 
management and staff to accommodate the Com- 
mittee’s audit and its inquiries. 

2.2.4 Interim Reports of the Committee 

In accordance with its charge, the Committee 
issued two interim progress reports in the form of 
letters to the NASA Administrator (see Appendix 
C). The first letter report was dated January 13, 
1987, some four months after the Committee first 
met. Presented in person by Committee Chairman 
Alton D. Slay to the Administrator and his key 
deputies, it presented four specific suggestions for 
improvement in aspects of the FMEA/CIL and 
hazard analysis processes, based on the initial phase 
of the Committee’s audit. The Administrator dis- 
cussed these matters with Chairman Slay, and then 
responded formally to SCRHAAC on April 22, 
1987, to describe actions taken with regard to the 
Committee’s concerns. As following sections will 
detail, specific changes in procedure and approach 
have already been made in response to two of the 
four suggestions (see NASA response to the first 
letter report, in Appendix C). 

In addition, Committee Chairman Slay appeared 
before the House Subcommittee on Space Science 
and Applications (Committee on Science, Space 
and Technology) on April 29, 1987, to discuss the 
findings contained in the first letter report. 

The Committee’s second letter report was issued 
July 22, 1987, and was again delivered personally 
by the Chairman and discussed with the Admin- 
istrator. It summarized SCRHAAC’s continuing 
activities and findings, also commenting on the 
actions taken by NASA in response to the first 
letter report. In this second report, eight new topics 
were addressed, some of them expressing approval 
of particular aspects of the STS risk assessment 
and management process, and planned changes, 
and others highlighting areas of concern on the 
part of the Committee. 

Some of the concerns expressed in the interim 
reports have been resolved since the reports were 
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TABLE 2-1 Meetings of the Committee on Shuttle Criticality Review and Hazard Analysis Audit 



Date 

Location 

Participants 

Purpose 

1 . 

9/22-23/86 

NRC, Washington, DC 

NASA Headquarters, JSC, MSFC & KSC staff Process overview, Committee 
Boeing ComnrVi Aircraft representatives planning 

2 

10/27-28/86 

Rockwell STS Div. 
Rocketdyne Div 
Los Angeles, CA 

Rockwell STS Div., Rocketdyne Div, NASA 
HQ, JSC, MSFC, USAF Space Div. and 
Aerospace Corp. staff 

SSME, Orbiter FMEA/CIL & 
hazard analysis audit 

3 

11/10/86 

NRC, Washington, DC 

NASA Assoc, Admins, for Space Flight & 
SRM&QA, NSTS Program Manager 

Discussion of concerns; draft 
first interim report 

4, 

12/1 5-16/86 

NASA JSC, Houston 

NSTS and JSC personnel (including Mission 
Operations & Astronaut personnel) 

Review STS risk mangement 
and operations 

5, 

1/14-16/87 

MSFC Huntsville, AL 
KSC FL 

MSFC and KSC leaders and staff related to 
STS 

Overview of MSFC & KSC 
FMEA/CILs & hazard analyses 

6 

2/10-1 1/87 

NRC, Washington, DC 

MSFC & JSC Indpndnt contractor staff, 
Quant. Risk Assess. (QRA) consultants 

QRA, Independent contractor 
FMEA/CIL reviews 

7. 

3/18/87 

Rocketdyne Div. 
Canoga Park, CA 

Rockwell STS Div.. Rocketdyne Div., NASA 
HQ, JSC, and MSFC staff 

SSME; STS integration 
activities 

8. 

4/24-25/87 

NRC. Washington, DC 

NASA HQ & JSC NSTS personnel NASA HQ 
SRM&QA personnel 

SRM&QA status and functions 
STS integration & software 

9 

5/28-29/87 

NRC Washington, DC 

NSTS Dep. Dir . Operations JSC, HQ 
personnel 

STS oprns, payloads, PCASS, 
system engineering draft 
second interim report 

10. 

7/13-14 87 

NRC, Woods Hole, MA Executive session 

Review & discuss information 
collected 

11. 

9/3-4/87 

NRC. Washington. DC 

Executive session 

Formulate conclusions, rec- 
ommendations; review drafts 

12. 

10/12/87 

NRC, Washington. DC 

Executive session 

Review & approve final text 


ACRONYMS 

CIL Critical Items List 

FMEA Failure Modes and Effects Analysis 

HQ Headquarters (of NASA) 

JSC Johnson Space Center 

KSC Kennedy Space Center 

MSFC Marshall Space Flight Center 

NASA National Aeronautics & Space Administration 

NRC National Research Council 


NSTS National Space Transportation System 
PCASS Program Compliance Assurance and Status System 
QRA Quantitative Risk Assessment 

SRM&QA Safety, Reliability, Maintainability & Quality 
Assurance 

SSME Space Shuttle Main Engine 

STS Space Transportation System 

USAF United States Air Force 


presented; others remain at issue. All of the con- 
cerns identified in those reports are discussed in 
Section 5 of this report. It should be noted that 
NASA’s safety process in general, and the current 
reevaluation in particular, have been undergoing 
considerable change following the Challenger ac- 
cident and during the Committee’s audit. Indeed, 
some of the changes have resulted from the Com- 
mittee’s discussions with NASA officials and from 
its interim reports. Thus, many of the subjects 
covered by this report have been “moving targets” 
that continued to change as this report was being 
prepared. However, the Committee believes that 
the report reflects the facts and circumstances as 
of September 1987. 

2.3 ORGANIZATION OF THE REPORT 

Following this introduction is Section 3, which 
presents an overview of NASA’s safety process for 


the NSTS Program as the Committee understands 
it. That section is provided as a tutorial for those 
who may not be familiar with this complex process. 
Section 4 briefly describes the Committee’s con- 
ception of modern risk management, including the 
essential element of objective risk assessment, and 
contrasts it with NASA’s safety process in general 
terms. 

The heart of the report is Section 5, which 
presents discussion, findings, and recommendations 
regarding particular aspects of NASA’s STS safety 
assurance process. It comprises the results of the 
Committee’s audit. The section is divided into 1 1 
subsections, each dealing with a different aspect of 
the process (with some encompassing related but 
distinct topics). 

Section 6 is a brief summary of the main “lessons 
learned” by SCRHAAC in the course of its audit. 
These lessons, derived from the STS review, are 
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considered to be applicable to other large and 
complex technological systems which, by their size 
and complexity, require the involvement of several 
major centers and organizations for their execution. 

Finally, a series of appendices are provided. 


Some, like Appendix A (“Acronyms and Defini- 
tions”), are intended as useful tools for the reader. 
Others are provided as amplification or background 
on various subjects addressed in the report. See the 
Table of Contents for a complete listing. 
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3 NASA’s Safety Process For The 
National Space Transportation 
System Program 


Before entering into a discussion of the Com- 
mittee’s findings regarding various specific aspects 
of the process that NASA relies on to ensure the 
safety of the Space Transportation System (STS), 
it may be useful to provide a basic overview of the 
elements and purposes of that process. Readers 
who arc already familiar with the structure and 
purposes of NASA’s present safety process may 
wish to skip over this '‘orientation" section and 
begin reading at Section 4. 

The measures taken to ensure safety follow basic 
NASA policy issued at the Administrator level. The 
implementation of that policy is guided and over- 
seen by descending levels of management through- 
out NASA Headquarters and the NASA field cen- 
ters and their contractors involved in STS 
development and operation. Various organizations 
within NASA have different and overlapping sets 
of responsibilities with respect to safety of the STS. 
At the heart of the safety process is a set of analyses 
of the system configuration and function. NASA’s 
activities in the safety area since the Challenger 
(51— L) disaster occurred have centered on these 
analyses and on the needed engineering changes in 
the STS system which the analyses have helped to 
identify. 

This section is intended to be only a factual 
description of NASA’s safety process, with empha- 
sis on policy and structure (as perceived by the 
Committee). The Committee’s analysis and com- 
ments arc presented beginning in Section 4. 

3.1. POLICY ON SAFETY 

NASA policy regarding safety is established by 
the Administrator through NASA Policy Directive 


(NPD) 1701.1, "Basic Policy on Safety.” The pur- 
pose of this document is to prescribe “the basic 
policy for planning, developing, conducting, and 
evaluating agency activities to ensure the highest 
practicable standards of safety in all NASA pro- 
grams.” The essence of the policy is to: 

“a. Avoid loss of life, injury of personnel, damage and 
property loss. 

“b. Instill a safety awareness in all NASA employees and 
contractors. 

“c. Assure that an organized and systematic approach is 
utilized to identify safety hazards and that safety is 
fully considered from conception to completion of all 
agency activities. 

“d. Review and evaluate plans, systems, and activities 
related to establishing and meeting safety requirements 
both by contractors and by NASA installations to 
ensure that desired objectives are effectively achieved. ” 

The accompanying NASA handbook (NHB 1700.1 
[VI]) states that "... the steps necessary to achieve 
safety of operations begin with initial planning and 
extend through every facet of NASA’s activities. 
Under this concept, every manager thoughout the 
organization is responsible for systematically iden- 
tifying risks, hazards, or unsafe situations or prac- 
tices, and for taking steps to assure adequate safety 
in the activities and products under his supervi- 
sion.” 

Out of this broad policy framework are derived 
the more specific safety requirements that are 
implemented in successively greater detail down 
through Headquarters, program and project or- 
ganizations at the NASA centers, and contractor 
organizations. 
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3.2 MANAGEMENT STRUCTURE 

3.2.1 Program Management 

The development and operation of the STS is 
carried out through a National Space Transpor- 
tation System (NSTS) Program. This Program draws 
on resources functionally located at three of the 
NASA field centers. Prior to the Challenger mission 
51-L the NSTS Program was managed out of 
Johnson Space Center (JSC), in Houston; JSC is 
also responsible for the Orbiter element of the STS 
as well as the integration of all STS elements. 
Marshall Space Flight Center (MSFC), in Alabama, 
is responsible for the propulsion elements of the 
STS: the Space Shuttle Main Engine (SSME), Solid 
Rocket Booster (SRB), which includes the Solid 
Rocket Motor (SRM), and External Tank (ET). 
Kennedy Space Center, in Florida, is responsible 
for major ground support equipment (GSE), and 
launch and landing operations. 

After mission 51-L, the NSTS Program Director 
was brought to NASA Headquarters (Level I) to 
manage the program from a location closer to top 
agency officials and at a level which has oversight 
of all three field centers. The Deputy Director 
(Program) of the NSTS Program remains at JSC; 
the recently established position of Deputy Director 


(Operations) is located at KSC. At each NASA 
center there are Project Managers responsible for 
the particular elements and systems. These Project 
Managers, in a matrix organizational arrangement, 
report functionally to the NSTS Program Director 
as well as organizationally to the center manage- 
ment. Reporting to the Project Managers are var- 
ious Subsystem Managers who are directly respon- 
sible for the engineering effort on their subsystems. 
Thus, within the center organization there are 
engineers and other personnel supporting the NSTS 
Program. 

Management levels within the NSTS Program 
are referred to as “Level 1, Level II”, and so on 
according to the hierarchy shown in Figure 3-1. 
Each level of management has a specific scope of 
responsibility, as described in the figure. Basically, 
Level I is Headquarters, primarily concerned with 
policy and broad program formulation and man- 
agement; Level II is the major program manage- 
ment level; and Level III is the project management 
level. The Level I Program Director is at Head- 
quarters, and reports to the Associate Administra- 
tor for Space Flight. Level II for development resides 
at JSC (viz., the Deputy Director [Program]) and 
at KSC for operations (the Deputy Program Direc- 
tor [Operations!), while Level III is dispersed across 
all of the participating NASA centers. 



LEVEL I: 

TOP LEVEL PROGRAM REQUIREMENTS, 
BUDGETS AND SCHEDULES. CONTROL OF 
CHANGES ABOVE $1 MILLION/YEAR OR TWO 
MILLION TOTAL, OR THOSE IMPACTING LEVEL 
I REQUIREMENTS OR SCHEDULES. 

LEVEL II: 

MANAGEMENT AND INTEGRATION OF ALL 
ELEMENTS OF THE PROGRAM. INTEGRATED 
FLIGHT AND GROUND SYSTEM REQUIREMENTS, 
SCHEDULES AND BUDGETS; CONTROL OF 
PROJECT INTERFACES; CONTROL OF CHANGES 
EXCEEDING PROJECT BUDGETS, OR THOSE 
IMPACTING LEVEL II REQUIREMENTS, 
INTERFACES, OR SCHEDULES. 

LEVEL III: 

PROJECT ORIENTED FLIGHT AND GROUND 
SYSTEM REQUIREMENTS, SCHEDULES, AND 
BUDGETS; CONTROL OF CHANGES WITHIN 
PROJECT LEVEL BUDGETS, SCHEDULES, AND 
SPECIFICATIONS. 

LEVEL IV: 

DETAILED FLIGHT AND GROUND SYSTEM 
REQUIREMENTS WITHIN ASSIGNED PROJECT. 
CONTROL AND IMPLEMENTATION OF 
DETAILED DESIGN. 


FIGURE 3-1 National Space Transportation System Program management relationships (after NASA). 
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3.2.2 Review Boards 

Each of the management levels has associated 
with it one or more hoards or panels that review 
and approve or disapprove the actions proposed 
by technical and other groups at the levels below. 
The most important of these boards arc the two 
Program Requirements Control Boards (PRCBs). 
One PRCB is at Level II and the other at Level I, 
chaired respectively by the NSTS Deputy Director 
(Program) and the NSTS Program Director. These 
boards meet together to review FMEA/CILs. The 
main Level III boards are the Configuration Control 
Boards (CCBs), one for each STS element and the 
two launch sites (KSC and Vandenburg AFB); each 
of the CCBs is supported by a number of Config- 
uration Control Panels (CCPs). (See Figure 3-2.) 

Each of these boards and panels has controlling 
authority for “dispositioning” (deciding upon or 
recommending) proposed changes to its documen- 
tation, hardware, and software — to the extent that 
the change does not conflict with requirements, 
schedules, budgets, etc., established by a higher- 
level board. Level I I/I PRCB approval is required 
for all changes to flight hardware after delivery to 
NASA and for all changes to flight hardware that 
interfaces with GSE. 

There are a considerable number of other Level 
II and III boards that are responsible for review' of 
specific technical and management aspects of STS 
design, development, and operation. All of them 
feed, ultimately, through the Level I I/I PRCBs, 
which are the highest boards for configuration 
control. These boards and their functions (some of 
which are shown in Figure 3-2) will be described 
further in Section 3.3, and from a different stand- 
point in Section 5.10.1. 

3.3 ORGANIZATIONAL ROLES 

As w'as noted in Section 3.1, in theory, safety in 
all its forms is equally the responsibility of all 
NASA managers and workers, as well as those of 
their contractors. In practice, roles and responsi- 
bilities are necessarily defined and allocated across 
various functional organizations. Within the NSTS 
Program, these safety-related roles are shared by 
the engineering organizations in the project offices; 
the Safety, Reliability, Maintainability, and Quality 
Assurance (SRM&QA) organization at Headquar- 
ters and the corresponding SR&QA organizations 
at the centers; the NSTS Engineering Integration 


Office; and, to a lesser extent, the operations 
organizations (i.e., the Astronaut Office and Mis- 
sion Operations Directorate). 

3.3.1 Engineering Project Offices 

The engineering organization within each ele- 
ment project office at the centers is responsible to 
a Project Manager and the Program Director for 
the performance and reliability of h a rdw a re/soft- 
ware systems they develop. Safety is thus an in- 
herent feature of the system design, development, 
testing, and production processes. Since it is engi- 
neers who design the unit or system, test it, certify 
it for operation, and inspect it after flight, it is they 
who have the greatest ability to understand and 
anticipate the w r ays in which the unit or system 
might fail. 

For that reason, NASA engineers have primary 
responsibility for carrying out the most technical 
of the safety analyses described in Section 3.4 (i.e., 
the Failure Modes and Effects and Analysis [FMEA]) 
and for establishing the rationale for retaining 
critical items identified through the FMEA. They 
participate secondarily in other safety analysis 
efforts. However, few of the engineers have any 
formal grounding in safety engineering techniques 
and methodologies. 

3.3.2 Safety, Reliability, Maintainability, and Quality 
Assurance 

Safety, Reliability, and Quality Assurance 
(SR&iQA) Offices (the maintainability function was 
added at Headquarters in 1986) have long existed 
in one form or another wfithin the various NASA 
centers as staff organizations reporting to the center 
director. (See Figure 3-3, for example.) The cor- 
responding Headquarters organization has existed 
as a policy-setting group reporting, until 1986, to 
the NASA Chief Engineer. 

Center SR&QA staff are detailed to programs 
such as the NSTS Program, w'here they develop 
functional units of staff dedicated to various aspects 
of Safety, Reliability, and Quality Assurance." Their 
role is to provide oversight of the engineering design 
and development activities, and to advise the Pro- 
ject Manager and the various configuration control 
boards on the safety and other relevant aspects of 
systems under review'. They are also responsible 


The center SR&QA organizations have, as of the time of writing, 
not adopted the “M" in their organization name. We have elected to 
adhere to current NASA practice to avoid confusion. 
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FIGURE 3-3 Organization of NASA Johnson Space Center (NASA). 


for keeping records on problems and anomalies 
encountered in the development and operation of 
the STS. 

SR&QA, through its Safety Divisions, has pri- 
mary responsibility for conducting hazard analyses 
of the STS (see Section 3.4.2 for a description). 
This is one of the most important safety-related 
analyses conducted on the STS, in many ways 
complementing the FMEA. 

In the wake of the Challenger accident, the 
functions and authority of SR&QA were expanded 
in scope, and the Headquarters organization was 
restructured. A new position of Associate Admin- 
istrator for SRM&QA was established, with appeal 
rights to the Administrator of NASA on any de- 
cision relevant to the safety of the STS and its 
crew. The new Associate Administrator intends to 
establish the SRM&QA function as an effective 
check and balance to the overall NASA operation, 
one that will provide a “second-look assessment” 
of the entire process from design through opera- 
tions. Figure 3-4 depicts the new SRM&QA or- 
ganization at Headquarters. 

3.3.3 Engineering Integration Office 

The NSTS Engineering Integration Office is lo- 
cated at JSC, where it handles certain special aspects 


of STS design and development that are crucial to 
the safe functioning of the overall system. These 
include: systems integration and interface design 
between the different STS elements, analyses of 
integrated structural loads and thermal effects, 
software requirements and configuration control, 
and ground systems and operations requirements. 
Shuttle avionics and ascent flight systems — two 
systems involving electronics and software func- 
tions which cut across various STS elements — are 
also among the responsibilities of this office. 

The organization of the office is shown in Figure 
3-5. Note that the figure identifies a separate review 
structure for systems integration and software. The 
Systems Integration Review (SIR) Board is a Level 
II board that supports the Level II and I PRCBs in 
all the integration areas, including ascent and entry, 
flight control, and thermal design. The Shuttle 
Avionics Software Control Board (SASCB) is the 
controlling authority for avionics software. Addi- 
tionally, a Mission Integration Control Board 
(MICB), shown in Figure 3-2, is the controlling 
authority for changes to delegated mission integra- 
tion requirements that do not affect other Level II 
requirements, budgets, or schedules. 

The Engineering Integration Office is also re- 
sponsible for carrying out a series of Element 
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FIGURE 3-4 Organization of the new office of Safety, Reliability, Maintainability, and Quality Assurance at NASA 
Headquarters (NASA) 


Interface Functional Analyses (EIFA), described in 
Section 3.4.3 below. 

3.4 SAFETY ANALYSES 

3.4.1 The Failure Modes and Effects Analysis and 
Critical Items List 

At the heart of NASA’s effort to ensure reliability 
of the Shuttle system is the Failure Modes and 
Effects Analysis. FMEAs are performed on all STS 
flight hardware as well as Ground Support Equip- 
ment which interfaces with flight hardware at the 
launch sites to identify hardware items that are 
critical to the performance and safety of the vehicle 
and the mission, and to identify items that do not 
meet design requirements. (NASA does not perform 
FMEAs on software; also excluded from the FMEA 
by definition are STS primary structure and, orig- 
inally, pressure vessels.) This analysis, carried out 
by the element contractor, begins with an identi- 
fication of the functional units of each system and 
a determination of the potential modes of failure 
for each unit. Each possible failure mode is then 
analyzed to determine the resulting performance 
of the system and to ascertain the worst-case effect 
that could result from a failure in that mode. All 
the identified items are then categorized according 
to the worst-case effect of the failure on the crew, 
the vehicle, and the mission. 

Table 3-1 shows the FMEA/CIL criticality clas- 


sifications, which are based on severity of effect. 
Items in the top four categories — Criticality 1, 1R, 
2, and 2R — comprise a Critical Items List (CIL). 
Essentially, this is a listing of all hardware items 
and their failure modes which do not meet certain 
design and reliability requirements (related to safety) 
set for the Shuttle system by Level I management. 
Those requirements (specified in JSC 07700, Vol. 
I, Appendix A, para. 2.8) are as follows: 

• “Redundancy requirements for all flight ve- 
hicle subsystems . . . [with specific exceptions] 

. . . shall be established on an individual basis, 
but shall be no less than fail-safe. 

• “Redundant systems shall be designed so that 
their operational status can be verified during 
ground turnaround and to the maximum ex- 
tent possible while in flight.” 

Therefore, in addition to single-point failures, the 
CIL also includes items that could fail in one mode 
and result in loss of the capability of redundant 
(backup) systems, items whose status is not readily 
detectable in flight, and redundant systems in which 
a single failure under certain conditions may result 
in loss of the total system capability. 

Critical items with these failure modes must be 
subjected to design improvements or to corrective 
action to meet the fail-safe and redundancy re- 
quirements, before the Shuttle can fly with them 
present. If that is not feasible, a waiver request 
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TABLE 3-1 FMEA/CIL Criticality Classification 


Criticality Category 

Potential Effect of Failure 

1 

Loss of life or vehicle 

1 R 

Redundant hardware element, failure of which could cause loss of life or vehicle 

2 

Loss of mission 

2R 

Redundant hardware element, failure of which could cause loss of mission 

3 

All others 

For Ground Support Equipment only: 

IS 

Failure of a safety or hazard monitoring system to detect, combat, or operate when 
required and could ailow loss of life or vehicle 

2S 

Loss of vehicle system 


must he submitted to NASA management to present 
the rationale for retaining an item that does not 
meet the requirements. Types of data included in 
this “retention rationale” include design, test, and 
inspection data, failure history, and operational 
experience. Figure 3-6 shows an example of a CIL 
document, including the retention rationale. 

An approved waiver must support the decision 
to accept the risk represented by the critical item 
and ensure that maintenance, test, or inspection 
procedures will minimize the potential for the 
failure to occur. Figure 3-7 depicts the review and 
approval process for critical items. Note that the 
key approval reviews are done by the CCB and 
PRCB review boards described in Section 3.2.2. 
After the PRCB meets, a directive is issued that 
documents items for which waivers have been 
granted and lists actions assigned by the Board. 
Each critical item, along with its approved waiver, 
is maintained by the NSTS Program, and any 
subsequent changes affecting the CIL must be 
approved by the NSTS Program Director. 

The FMEA/CIL was originally conceived as a 
design tool, used to ensure the early identification 
and disposal of critical failure modes, as well as to 
support other reviews of the STS design. Since 
mission 51-L it is now also an operational and 
management tool, used for problem analysis, to 
assess the efficacy of corrective actions, to identify 
maintenance checkout requirements and inspection 
points, and to reflect trends in failure history. 

3.4.2 Hazard Analysis 

Hazard analysis is another analytical tool used 
to identify and, if possible, resolve hazardous 
conditions that could develop while operating and 
maintaining STS hardware and software. Hazard 


identification is performed collectively by the NSTS 
engineering, safety, and operations organizations. 
Sources of information used to identify hazards 
include the FMEA/CIL, as well as various design 
reviews, safety analyses, crew procedures devel- 
opment, flight anomaly reports, and other sources. 
Hazard analyses thus consider not only the failures 
identified in the FMEA process, but also other 
potential threats posed by the environment, crew/ 
machine interfaces, and mission activities. There 
are several different types of hazard analyses, as 
listed in Table 3-2. A typical Hazard (analysis) 
Report (HR) is shown as Figure 3-8. 

Identified hazards and their causes are analyzed 
by Safety Division staff of the SR&QA offices at 
the NASA centers (and their contractors) to find 
ways to eliminate or control the hazard. A hazard 
is said to be “eliminated” when its source has been 
removed. A “controlled hazard” is one that has 
effectively been controlled by a design change, the 
addition of safety or warning devices, procedural 
changes, or operational constraints. Any hazard 
that cannot feasibly be eliminated or controlled by 
these means is termed an “accepted risk”, and 
requires review and approval by Level III and II 
management boards and their chairmen. SR&QA 
maintains a closed-loop tracking system for hazard 
documentation, resolution, and approval. The basic 
steps in hazard processing and review are depicted 
in Figure 3—9 and Figure 3—10. 

Indicated in both of the latter figures is a Mission 
Safety Assessment (MSA). This is a report, prepared 
by the Safety Division for each STS flight mission, 
which provides an integrated and comprehensive 
assessment of all activities and hazards associated 
with a mission, including turnaround activities. It 
also provides a way to identify and “baseline” 
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SHUTTLE CRITICAL ITEMS LIST - CfRBITER 


SUBSYSTEM 

: LANDING DECELERATION 

FMEA NO 02-1 -001 

-1 

REV: 02/09/82 

ASSEMBLY 

:MAIN LANDING GEAR 

ABORT: 


CRIT. 

FUNG: 

1 

,P/N RI 

:MC621-0011 



CRIT. 

HEW: 

1 

■ P/N VENDOR: 1170100 MENASOO 

VEHICLE 

102 

099 

103 

104 

QUANTITY 

:2 

E3TECITVITY: 

X 

X 

X 

X 


:LEFT HAND 

FHASE(S) PL 

ID 

00 

DO X IS 



: RIGHT HAND 

REDUNDANCY SCREEN: A-N/A B-N/A C-N/A 

FHEPARED BY: APPROVED BY: APPROVED BY (NASA) : 

EES L L FHDCES DES SSM 

REL A L DOENER REL REL 


.ITEM: MLG STOTT 

MLG SHOCK STOTT INNER AND CUTER CYLINDER AND LOAD CARRYING MEMBERS. 
.FUNCTION: 

MLG LOAD CARRYING MEMBERS CYLINDER - DAMPER, WHERE A PASSAGE OF 
HYDRAULIC FLUID THROUGH AN CRFICE ABSORBS THE ENERGY OF IMPACT AND 
WHERE CRY NITROGEN IS USED AS THE ELASTIC MEDIUM TO RESTORE THE 
UNSFEUNG PARIS TO THEIR EXTENDED POSITION. 

FAILURE MODE: STRUCTURAL FAILURE 

CAUSE(S) : 

STRESS CORROSION. PIECE- PART STRUCTURAL FAILURE. OVERLOAD. 

EFFECT (S) CN (A) SUBSYSTEM (B) INTERFACES (C) MISSION (D) CREW/VEHICLE: 

(A) LOSS OF SUBSYSTEM FUNCTION. (B) NONE, (C) NONE. (D) PROBABLE 
LOSS OF VEHICLE IF MAIN STOTT FAILS CN LANDING. 

DISPOSITION & RATIONALE (A) DESIGN (B) TEST (C) INSPECTION (D) FAILURE HISTORY: 
(A) UNDER VERST CASE LOADING (FIAT STOTT) THE STOTT IS CAPABLE OF 
WITHSTANDING ONE LANDING AT THE NORMAL LANDING DESIGN GROSS WEIGHT OF 
207,000 LBS. AND SINK SPEED OF 9.6 FEET FER SEXXND WITH CXKRES PONDING 
LANDING ROLLOUT AND BRAKING CONDITIONS, WITH NO YIELDING OF THE 
STRUCTURAL MEMBERS. (B) ACCEPTANCE INCLUDES VERIFICATION THAT 
CERTIFIED MATERIALS AND PROCESSES WERE USED. CERTIFICATION INCLUDES A 
FATIGUE LOAD TEST SPECTRUM (REF MC62-0011 TABLES 10-11) REPRESENTING THE 
EXJJTVA LENT LOADING FOR THE LIFE OF EACH LANDING GEAR WITH A SCATTER 
FACTOR OF 4.0. THE STATIC LOAD TESTS INCLUDED A TAXI BUMP (65K 
PAYLOAD) , VEHICLE WEIGHT 227 KIPS/AND A RIGHT TUFN/WKICH IS THE WORST 
CASE CONDITIONS WITHOUT FAILURE. (C) DURING TURNAICUND-VISUALLY INSPECT 
FUR DAMAGE. USE NDE TO SPPQFT SUSPECT AREAS. AT MANUFACTURER-RAW 
MATERIAL VERIFTED-VISUALL INS P./ ID PERFURMED-PAKIS PROTECTION, COATING 
AND PLATING PROCESSES VERIF. BY INSPECTION. -MANUF. , INSTL. AND ASSY. 
OPERATIONS VERIF. BY SHOP TRAVELER MIPS-OCKROSICN PROTECTION PROVISIONS 
VERIF. NDE OF SURFACE AND SUB-SURFACE D E FECTS VERIF. BY INSPECTION. 
PROPERLY FENITCRED HANDLING AND STORAGE ENVIRONMENT VERIFIED. MAIL. AND 
EQUIPMENT CCNFOFMANCE TO CONTRACT REXKDS. VERIFIED BY INSP. -FINDINGS 
VERIFIED BY AUDIT 9-25-78. (D) DURING DROP TEST PROGRAM, THE CUTER 

GLAND NUT FAILED. MENASOO REDESIGNED AND CHANGED FROM ALUMINUM TO STEEL 
MATL. THE SNUBBER RING P/N 1170134-1 WAS REDESIGNED. UPPER BEARING 
1170107-1 WAS REPLACED BY A SOLID ALUMINUM- BRONZE BEARING. 


FIGURE 3-6 An example of a Critical items List document (NASA). 


hazards (i.e., to establish their “normal” — ac- 
cepted — state or level) for future flights. 

3.4.3 Element Interface Functional Analysis 

Provision is made in NASA’s risk management 
process for checking cross-element interface failure 
modes and effects by a number of means. One 
method used is the Element Interface Functional 
Analysis, prepared by the NSTS Engineering Inte- 
gration Office with the support of Rockwell Inter- 
national. EIFAs arc analyses of various functional 
failure modes that can occur at element-to-element 
interfaces as a result of a hardware failure in either 
element. There are three EIFAs: Orbiter/ET, Or- 
biter/SSME, and Orbiter/SRB-ET. (A fourth E1FA, 


on ground/flight systems, is now being generated.) 

The purpose of these analyses is to correlate 
element hardware failures with failure modes at 
the element interface to determine the effect on the 
mission, vehicle, or crew safety. EIFAs also look 
for failure propagation across interfaces. The EIFA 
activity helps to ensure that FMEA items are 
correctly classified as to their criticality. 

3.4.4 Other Analyses 

Providing basic input to the hazard analysis is a 
diverse group of safety analyses. NHB 5300.4 (ID- 
2) describes these analyses as follows: 

“Safety analyses are performed at the integrated and element 
(STS) levels and down to the component level to assure 




24 


FIGURE 3-7 Review and approval process for STS critical items (NASA NSTS). 









ORIGINAL page is 

TABLE 3-2 Types of Hazard Analyses OJE POOR QUALITY 


Type of Analysis 

Program Phase 

Why Used 

Preliminary Hazard 
Analyses 

Fault Tree Analyses 

Concept/design and 
development 

Concept/design and 
development/operations 

Allows top level hazard definition by generic hazard and 
lends itself to expansion as the program progresses 

Allows in-depth analysis of selected critical areas and 
relationships among events. 

Sneak Analysis 

Design and development 
phase (when detailed de- 
sign available)/operations 

Allows identification of latent nonfailure conditions that may 
allow undesired conditions or prevent desired conditions 

Software Hazard Analysis 

Design and development 
phase/operations 

Allows independent verification that software code imple- 
ments approved requirement 

Operations Hazard 
Analysis 

Design and development 
phase/operations 

Allows identification of hazardous conditions during opera- 
tions caused by such things as out-of-sequence operation, 
omitted steps, and interaction of elements 

Mission Level Hazard 
Analysis 

Design and development 
phase/operations 

Allows detailed analysis of mission events considering hard- 
ware, crew, ground operations, and software interactions 

Mission Safety Assessment 

Design and development 
phase/operations 

Allows assessment of previously conducted analyses for 
completeness and accuracy, provides analyses and pro- 
vides visibility of hazards by mission phase and event. 

(Source: NASA JSC) 


identification of hazardous conditions, hazard causes, hazard 
effects, hazard levels, corrective actions, and rationale for 
h;i/;ird closure." 

An important subset of safety analyses arc the 
systems safety analyses, defined as follows (in NHB 
1700.1 (V3), System Safety): 

"Systems safety analyses are performed for the purpose of 
identifying hazards and establishing risk levels ... in support 
of this concept the analyses perform five basic functions: 

"a. Provide the foundation for the development of safety 
criteria and requirements. 

b. Determine both whether and how the safety criteria 
and requirements provided to engineering have been 
included in the design(s). 

c. Determine w hether the safety criteria and requirements 
created for that design have provided for adequate 
safety for the system. 

d. Provide part of the means for meeting pre-established 
safety goals. 

“e. Provide a means of demonstrating that safety goals 
have been met." 

Two other important safety analyses are the 
Integrated Hazard Analysis (IHA) and Critical 
Functions Assessment (CFA). The NSTS Engineer- 
ing Integration Office, with the support of Rockwell 
International (the integration support contractor) 
produces an IHA when a potential risk situation 
or unsafe condition is perceived, the resolution of 
which involves two or more STS elements. These 


analyses are reviewed by the System Integration 
Review Board (SIR), described earlier. 

I he CFA, a one-time effort completed in 1978, 
examined critical functions during each mission 
phase and identified hardware and software changes 
which would improve safety. The CFA included 
certain multiple and cascading failure combina- 
tions; it is currently being reexamined by Rockwell 
International to verify the results of the initial 
assessment and provide an update to the current 
STS configuration. 

3.4.5 Overall Scope of Analyses 

The various analysis techniques employed by 
NASA are intended to provide an all-encompassing 
approach to ensuring the design reliability and 
safety of the STS. Some of the techniques, princi- 
pally the hazard analyses and EIFA, tend to be 
“top-down” approaches that examine certain cross- 
systems causes and effects. Others, such as FMEA/ 
CIL, are narrower “bottom-up” analyses that pur- 
sue a specific event to its conclusion — but only 
with respect to the piece of hardware involved. In 
a briefing to the Committee, Rockwell International 
presented its view of this interaction, summarized 
in Figure 3-1 I. 

The FMEA/CIL, EIFA, and other safety analyses 
feed into the various hazard analyses in a one-way 
flow culminating in the Mission Safety Assessment. 
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FIGURE 3-8 Excerpt from a sample Space Shuttle preliminary hazard analysis report (NASA). 
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FIGURE 3-10 Hazard analysis review process (NASA JSC). 
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FIGURE 3-11 Interaction of top-down and bottom-up analysis techniques used in STS reliability and safety 
assessments (Rockwell STS Division). 




TABLE 3-3 Critical Item Review Teams 


Shuttle Element 

Prime Contractor 

Independent Review Contractor 

Orbiter (JSC) 

Rockwell International, STS Division 

McDonnell Douglas Astronautics Co., 
Houston Division 

External Tank (MSFC) 

Martin Marietta, Michoud Aerospace 
Div. 

Rockwell International, Space 
Transportation Systems Division 

Solid Rocket Motor (MSFC) 

Morton Thiokol, Inc., Wasatch 
Operations 

Martin Marietta, Denver Aerospace Division 

Solid Rocket Booster (MSFC) 

United Technologies Corp , United 
Space Boosters, Inc. 

Martin Marietta, Denver Aerospace Division 

Space Shuttle Main Engine 
(MSFC) 

Rockwell International, 
Rocketdyne Division 

Martin Marietta, Denver Aerospace Division 


(Source: NASA) 


As a practical matter (as discussed in Sections 5.1 
and 5.3) the FMEA/CIL, with its retention ration- 
ale, appears to he the dominant analysis, on which 
the waiver and some of the engineering change 
decisions arc primarily based. 

3.5 POST-5 1L REEVALUATION/REVIEW 

3.5.1 NASA Management Directives 

In March 1986, soon after the Challenger acci- 
dent, direction was sent out from the Associate 
Administrator for Space Flight and the NSTS Pro- 
gram Director to the NSTS Project Offices to 
reevaluate (“re-review”) the FMEAs on all critical 
items on the STS. The Program Director described 
the purpose of the reevaluation as: “. . . to affirm 
the completeness and accuracy of the FMEA/CIL 
for the current National STS design. ”* Following 
reevaluation of the FMEA, each Criticality 1 and 
1R item, along with any new items, or items for 
which the reevaluation had led to a change in 
classification, was to be resubmitted for review and 
approval of the waiver permitting the item to be 
flown aboard the STS. Authority for approval of 
these waivers resides at the Level I PRCB, with the 
NSTS Program Director having final sign-off au- 
thority. 

Those items not revalidated by the review were 
required to be redesigned, certified, and qualified 
for flight. In addition to the FMEA/CIL reevalua- 
tion, the directives stipulated that the hazard analy- 
ses and EIFAs also be reviewed. 


* Memorandum of March Id, 1986 . 


3.5.2 Process 

FMEA/CIL. Each NSTS project and its prime 
contractor carried out the FMEA/CIL reevaluation, 
usually doing two separate reviews. In addition, 
independent contractors not otherwise involved in 
working on that element were selected to conduct 
parallel reviews of the FMEA/CIL for each element 
and to report the results of their assessments to 
NASA’s review team. These independent review's 
emphasized any analysis results that differed from 
those identified by NASA or the element prime 
contractor. The FMEA/CIL review participants are 
listed in Table 3-3. 

The processing flow for the reevaluation initially 
varied somewhat from center to center, but was 
essentially like that shown in Figure 3-12 (from 
JSC). During the reevaluation, special effort has 
been directed to identifying design enhancements, 
operational and procedural checkout changes, or 
software additions that reduce the criticality and/ 
or minimize the chance that the potential failure 
mode will occur. 

The main difference between the re-review and 
the “normal review process” is the conduct of the 
independent reviews. Another significant difference 
is that the groundrules for determining Criticality 1 
status were changed: FMEAs are now carried down 
to the individual component level (even where 
multiple identical components are involved), and 
pressure vessels (formerly excluded) are now in- 
cluded. These and other changes in procedure are 
specified in a new document, NSTS 22206, “In- 
structions for Preparation of Failure Modes and 
Effects Analysis and Critical Items List,” which 
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FIGURE 3-12 Typical processing flow for the current reevaluation of STS FMEA/CILs (NASA NSTS). 












was issued in October 1986 to standardize the 
process across the program. 

Hazard Analysis. A similar review of all ele- 
ment and integrated system-level hazard analyses 
is being undertaken in response to the Challenger 
accident. As in the case of FMEA/CIL, each project 
office, its prime contractor, and the independent 
contractor are evaluating all hazard analyses and 
Hazard Reports to verify their completeness and 
accuracy. Figure 3-13 illustrates the current review 
process. 

Each hazard analysis assessment is being con- 
ducted in accordance with the guidance provided 
in a new document, NSTS 22254, "Methodology 
for Conduct of NSTS Hazard Analyses.” This 
document defines the policy and procedures re- 
quired for preparing hazard analyses, Hazard Re- 
ports, and Mission Safety Assessments. 

The current review consists of a technical safety 
evaluation of the source material used for all 
analyses, studies, and investigations conducted from 
the beginning of STS flight. Each subsystem as- 
sessment is expected to ensure that all hazards have 
been identified, that dispositions are accurate, and 
that identified risks are acceptable. 

3.5.3 Relation to Engineering Redesign Activity 

Since the mission 51-L accident, a substantial 
number of engineering changes have been under- 
taken to improve Shuttle safety prior to resumption 
of flight. Shortly after the Challenger accident, 
groups representing various organizational ele- 
ments of NASA (design centers, Astronaut Office, 


etc.) presented the NSTS Program Director with 
lists of items which they considered as needing 
attention. All were Criticality 1 or 1R items. From 
these lists, a special Level II senior management 
PRCB known as the System Design Review Board 
recommended the selection of 90 items (consisting 
of hardware, software, and procedures) to undergo 
redesign, test, or analysis before the next flight of 
the Shuttle. Other items were categorized as near- 
term and "opportunity” actions. Since that time, 
the number of mandatory next- flight changes across 
the STS system has grown to 159. 

The redesign activity has, for the most part, 
preceded the FMEA/CIL and hazard analysis re- 
evaluations. Relatively few of the early items iden- 
tified for next-flight change derived from the re- 
evaluation activity. However, as the reevaluations 
proceeded they did disclose a number of items 
which are being worked before the next flight. 
FMEA/CILs and hazard analyses are being gener- 
ated for all STS elements and modifications. The 
PRCB constitutes itself as the System Design Review 
Board to review all waiver recommendations on 
critical items. 

3.5.4 Relation to Flight Readiness Process 

The results of the various safety-related analyses 
feed into the flight review and readiness processes. 
By the time of the Design Certification Review 
(DCR), three months before launch, all FMEA/CIL 
waiver decisions, Hazard Reports, and the Mission 
Safety Assessment are available for review by the 
relevant readiness review boards. 



FIGURE 3-13 Steps in the current hazard analysis reevaluation process (NASA). 
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3.5.5 Data Input and Output 

Among the most important types of data for use 
in developing and updating the CIL retention 
rationale and conducting hazard analyses is feed- 
back from actual use of the hardware. STS equip- 
ment tests, preflight checkout, postflight inspec- 
tions, and inflight operational experience and data 
are all crucial sources of this type of data. NASA 
uses a number of special reports and reporting 
systems to collect and integrate such data. They 
include the following, whose names are self-ex- 
planatory: 

• Problem Reporting and Corrective Action 
(PRACA) System 

• Problem Reports (PRs) 

• Discrepancy Reports (DRs) [for software] 

• Unsatisfactory Condition Reports (UCRs) 

• Failure Reports 

The PRACA system is a large, distributed data 
base (one for each STS element and one for KSC 
ground support equipment) that contains all of the 
reports listed above, along with data on corrective 
actions taken. PRACA is the basis for many design 
changes. Problems found in a postflight assessment 
are logged into the PRACA system at the design 
center for that element, and all problems are tracked 
by JSC/NSTS via a flight anomaly report, or Failure 
Report. The Failure Report is cross-correlated with 
the FMEA/CIL number. 

Steps are being taken to ensure that the results 
of safety analyses are available to NASA managers 
in a more thorough and timely fashion. For ex- 
ample, NASA is setting up a closed-loop accounting 
and review system, by which all Criticality 1, 1R, 


and IS items are being tied to problem reports and 
their resolutions. This new System Integrity Assur- 
ance Program (S1AP), being developed under the 
NSTS Engineering Integration Office, is intended 
to ensure that STS flight and ground systems retain 
their design performance, reliability, and safety. It 
draws on the FMEA/CIL, hazard analyses, and 
other existing safety analysis systems. 

A major component of the SIAP is its Program 
Compliance Assurance Status System (PCASS) — 
essentially a computer-based management infor- 
mation system. The PCASS will serve as a central 
data base integrating a number of existing infor- 
mation systems and sources across the NSTS. For 
example, the PRACA will be a part of it, facilitating 
the reduction and presentation of data on flight 
anomalies. It will provide in near real-time, to users 
such as the participants in Flight Readiness Re- 
views, an integrated view of the status of problems 
with the STS, including trends, anomalies and 
deviations, and closure information. One of the 
major advantages of PCASS is that it will give 
SR&QA staff an easy route of access into the entire 
system of data bases dealing with the STS. Even- 
tually, it will provide automated information on 
critical item status and hazard data, with a com- 
puterized FMEA planned as one of the inputs. 

NASA Headquarters SRM&QA is also planning 
an extensive system for the documentation, re- 
porting, review, and assessment of safety infor- 
mation. The NASA Safety Information System 
(NSIS) and the Shuttle Hazards Information Man- 
agement System (SHIMS) — an STS hazards data 
base — are two examples. 

These input and output mechanisms provide the 
essential connectivity of the safety analyses to the 
continuing development, improvement, and oper- 
ation of the STS within the NSTS Program. 
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4 Risk Assessment and Risk 
Management: The Committee’s View 


4.1 GENERAL CONCEPT 

Almost lost in the strong public reaction to the 
Challenger failure was the inescapable fact that 
major advances in mankind's capability to explore 
and operate in space — indeed, even in routine 
atmospheric flight — will only be accomplished in 
the face of risk. The risks of space flight must be 
accepted by those who are asked to participate in 
each flight as well as by those who arc responsible 
for the program. The Committee believes that the 
basis for NASA's acceptance of those risks should 
stem as much as possible from rationally derived 
criteria. This acceptance also should depend very 
heavily on the quality of the methodology and the 
degree of objectivity by which the risks arc deter- 
mined, as well as the rigor by which the risks are 
controlled (i.e., managed). 

The Committee began its audit activities by 
focusing specifically on the FMEA, the CIL, and 
the hazard analysis process. However, very early 
in the data gathering phase it became clear that 
NASA's processes for analyzing failure modes, 
effects, and hazards could only be understood and 
evaluated intelligently when viewed as elements of 
an overall program of risk assessment and risk 
management. In the Committee's view, any such 
program should include the following basic ele- 
ments: 

1. A comprehensive method for identifying po- 
tential failure modes and hazards associated 
with the system. 

2. A specific, quantitative methodology for iden- 
tifying and assessing (or estimating) the safety 
risks of the system. 


3. A risk management process by which the 
safety risks can be brought to levels or values 
that arc acceptable to the final approval 
authority. Risk management includes: 

— establishment of acceptable risk levels; 

— institution of changes in system design or 
operational methods to achieve such risk 
levels; 

— system validation and certification; and 

— system quality assurance. 

In this usage, we define a “safety risk" as the 
probability (likelihood or chance) of suffering a 
particular consequence of a failure mode, mishap, 
or hazard. For a large, complex system such as the 
STS, there is a set of system risks each of which is 
comprised of many contributing risks. Thus, we 
use the plural “safety risks " of the system, since 
one may choose to manage these risks to different 
levels. 

There are actually two major functions present 
in the listing above. Risk assessment is comprised 
of the first two elements, identification and assess- 
ment of both the failure modes and hazards, and 
the safety risks associated with them. Risk assess- 
ment is or should be a staff function, the results 
of which are provided as input to management. 
Risk management , on the other hand (the third 
element above), must primarily be a line manage- 
ment function. Within NASA, SRM&QA at Head- 
quarters and SR&QA at the centers are staff 
organizations. The Associate Administrator for 
SRM&QA reports to the NASA Administrator. 
Line management authority for NSTS extends from 
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the Administrator to the Level 1 Associate Admin- 
istrator for Space Flight to the NSTS Program 
Director and thence through the Level 11 Program 
Office to the Level III project managers. 

The concept of risk assessment and risk man- 
agement is employed very explicitly within some 
private industries and public enterprises engaged 
in the engineering development of complex systems. 
The nuclear power industry is one such, and the 
commercial aerospace industry is another. Within 
the USAF Systems Command (including the Space 
Division, which develops military launch vehicles 
and spacecraft), risk assessment consists of a wide 
range of qualitative and quantitative tools, includ- 
ing the FMEA and hazard analysis. Risk manage- 
ment is viewed as a formal process involving the 
establishment, assessment, and control of risk to 
predetermined acceptable levels. 

Figure 4- 1 illustrates a generic type of program 
planning and tracking chart that is used in risk 
management by the USAF. Levels of risk in the 
system, as evaluated by a specific risk assessment 
methodology, are plotted against time (and the 
cost) to correct the problems contributing to risk. 
In this generic example, actual risk lags and exceeds 
the planned levels of risk for each category of risk, 
and throughout most of the program. The planned 
risk presents a target toward which the system risk 
is actively managed. The risk levels assessed at the 
conceptual design stage must eventually be evolved, 
through engineering, down to levels acceptable to 
the approval authority (i.e., high level, program 
line management). This is accomplished through a 
“systems safety engineering” function that is an 
integral part of the engineering design and devel- 
opment process from its inception. 

4.2 NASA’S PROCESS: OVERALL 
COMMENTS 

The fundamental view' of risk assessment and 
management discussed above took shape over the 
first few months of the Committee’s activities. It 
formed a framework w'ithin which the Committee 
could conduct the subsequent stages of the audit 
and more confidently evaluate NASA’s STS safety 
program — of which the FMEAs, CILs, and hazard 
analyses are only a few important parts. Much of 
the remainder of this report reflects the results of 
our inquiry into specific aspects of the ways in 
which NASA assesses and manages risks in the 
NSTS program. But we believe it is important, 


before plunging into specifics, to provide a sense 
of the “big picture” within which the Committee 
conducted its audit, and to give a general assessment 
of how NASA’s current process (as described in 
Section 3) relates to that picture. 


4.2.1 NASA Risk Assessment 

NASA defines risk as: “the chance (qualitative) 
of loss of personnel capability, loss of system, or 
damage to or loss of equipment or property.” 
[NHB 5300.4 (ID-2), p. a-4] 

To identify potential failure modes and hazards, 
NASA uses input from many different sources: 
analyses, data gathering processes, design reviews, 
etc. Figure 4-2, obtained from the SR&QA Office 
at JSC, lists most of these sources for the NSTS. 
(However, the Committee is not aware of any 
FMEAs or hazard analyses being conducted on 
software.) If employed rigorously, these tools pro- 
vide a good basis for achieving element 1 of the 
three specified in Section 4. 1. How'ever, this list of 
sources might more appropriately be titled “Iden- 
tify Potential Failures and Hazards,” because most 
of the activities listed do not deal with risk. For 
example, the failure modes analysis identifies pos- 
sible hardware failure modes, but usually says little 
about the risk associated with each of them. When 
the effects analysis is added in, then part of the 
input needed to establish risk has been gained, but 
still nothing is inferred about the probability of 
occurrence of either the failure itself or the various 
possible effects that might result. A similar situation 
occurs in the identification of hazards. 

One can categorize failure modes on the basis 
of the consequences of their worst-case effects, as 
is done in a very rough w'ay in the Critical Items 
List, for failure modes w'hose worst-case effects 
lead (for example) to loss of life or vehicle. Such a 
categorization is useful for calling urgent attention 
to certain failure modes and their attendant haz- 
ards. Nevertheless, the listing of such items does 
not establish their contribution to the various risks 
of the system. In the NASA safety process, each 
item on the CIL has a retention rationale written 
for it. These retention rationale statements usually 
contain information w'hich could, if used properly, 
contribute to a process for estimating the associated 
risk. However, the rationales appear to be used 
strictly as arguments for a waiver of the NSTS 
requirement that no single-point Criticality 1 or 
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TIME (PROGRAM PHASES AND $) 

FIGURE 4-1 Conceptual diagram of risk management involving iterative steps taken to achieve specified levels of acceptable risk 
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Techniques for the identification of potential sources of risk in the NSTS Program (after NASA JSC 


1R failure modes be present when a mission is 
launched (see Sections 3.4.1 and 5.1). 

Similarly, in NASA’s hazard analysis process, 
hazards are categorized as to level and status. 
Hazards are defined as either critical or cata- 
strophic, depending on whether or not there is time 
for any possible emergency action to be taken. 
Each “closed” hazard is categorized as being elim- 
inated, controlled, or an “accepted risk.” Ration- 
ales are written to justify accepting the uncontrolled 
hazards; many times the same rationale is employed 
that was used for retaining the critical failure modes 
(see Section 5.3 for elaboration). However, as in 
the case of the CILs, these justifications do not 
establish the risk levels of the hazards. Thus, 
although the term “risk assessment” is used in 
many different ways and places in NASA docu- 
ments and presentations, the Committee found that 
nowhere was the total activity described that is 
needed to accomplish element 2 in Section 4.1 
above (i.e., a quantitative methodology for assess- 
ing safety risks). 


In NASA’s definition of risk (above), the word 
“chance” is used as the measure (or basis of 
comparison) of the risk. The definition clearly 
implies evaluation of a set of risks based on the 
chance of occurrence of each of the various con- 
sequences described. However, NASA acknowl- 
edges, and our reviews have confirmed, that these 
“chances” are not formally or specifically esti- 
mated; nor are they documented. Rather, STS risks 
are assessed based on subjective judgments and the 
approval of qualitative rationales by various board 
and panel chairmen, and Level II and I authorities, 
as described in Section 3. However, many quanti- 
tative engineering analyses and test data relevant 
to risk assessment are available and often are used 
in arriving at what are finally qualitative subjective 
judgements. With such a non-specific (i.e., non- 
value based) risk acceptance process there is little 
basis for making objective comparisons of the 
several major risk categories associated with the 
STS, nor for carrying out risk evaluations by 
independent agencies. Neither can one systemati- 
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tally evaluate the results of efforts to reduce the 
risk of the various possible losses. Without more 
objective, quantifiable measures of relative risk it 
is not clear how NASA can expect to implement a 
truly effective risk management program. 

4.2.2 NASA Risk Management 

The various NASA documents identified in Sec- 
tions 3. 1 and 3.4, with some of their key provisions 
noted, basically describe a framework within which 
to operate an effective risk management program. 
At the core of such a program is the idea of risk 
management through the control of hazards. Re- 
sidual hazards (risks) that cannot be designed away 
would be controlled at least to levels consistent 
with program objectives and cost constraints. The 
definition and analysis of hazards and levels of risk 
associated with a system and its operation was to 
be performed within a system safety function. Since 
the effective level of hazard control was not always 
expected to be perfect, a "residual hazard risk 
analysis” would be performed to provide the re- 
tention rationale for accepting such hazards and 
for continuing to operate (perhaps with con- 
straints). 

In parallel with and providing inputs to this 
system safety function is a reliability activity. This 
function was to be basically concerned with estab- 
lishing a data base for selection of components 
which would meet allocated failure probability- 
requirements; performing failure mode and effects 
analyses; establishing redundancy criteria and con- 
figuration definitions, maintainability criteria, and 
life limits; and preparing critical items lists con- 
taining items with single-point failure modes which 
could cause catastrophic results. 

A third element in the overall safety and risk 
management program is quality assurance. This 
function, as defined by NASA, would be responsible 
for assuring that the hardware and software pro- 
duced for the system was produced in a controlled 
way and met all requirements of the quality control 
criteria documents. This assurance role also in- 
cludes supervision of personnel certification and 
establishment of non-destructive testing methods 
to detect flaws in components and non-conforming 
materials. 

These functions provide the basic staff capability 
which line management can bring to bear on the 
management of risk in the NSTS Program. NASA’s 
own explicit view of risk management for the NSTS 


was described to the Committee at JSC. It is 
conceived to be a synthesis of activities in four 
broad categories: 

• Programmatic 

• Engineering/development 

• Mission operations 

• Product assurance 

As depicted in Figure 4-3, activities in all cate- 
gories are conducted throughout all phases of the 
NSTS Program, from concept definition to flight 
operations. The risk management process is said 
to be characterized by top-down direction and 
control, with ”bottom-up” response and account- 
ability from the staff organizations and line man- 
agement at the NASA centers. The process of risk 
assessment and management is described as one of 
“independent but integrated participation” by Pro- 
gram management, design/development (project 
engineering), operations (Astronaut Office and 
Mission Operations Directorate), and SR&QA. 
These terms are key: the degree of independence 
and integration of organizations and functions 
within the overall process comprise a major, re- 
curring theme of the discussion presented in the 
following Section 5. 

4.3 SUMMARY 

The basic organizational elements are in place 
within NASA for assessing and managing risk; 
however, there is a need for a change in the scope 
of functions and the w-ay that they are carried out. 
Certain shortcomings in process and methodology 
exist which are discussed in the follow-ing section. 
In particular, there is a fundamental problem in 
the nature of and the methods used to develop the 
overall assessments on which NASA line manage- 
ment bases its decisions about how- to reduce and 
control risk in the STS. Also, it appears to the 
Committee that there is no clear, formal, and 
rigorous view- among NASA line managers — at 
least on any consistent basis — of the nature and 
goals of risk management. 

I o reiterate what was said earlier, the Committee 
believes that risk management for any system 
involving complex engineering must be the respon- 
sibility of line management — -i.e., (in the case of 
the NSTS) the system Program Manager, the As- 
sociate Administrator for Space Flight and, ulti- 
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mately, the Administrator of NASA. Only this 
program management, not the safety organizations, 
can make judicious use of the means available to 
achieve the operational goals while evolving the 
safety risks down to acceptable levels, as described 
earlier. The safety organizations at NASA centers 
and Headquarters are staff organizations— i.e., they 
can and should be responsible for providing the 
assessments of the system’s risks. They should also 


be responsible for assuring that the activities as- 
sociated with controlling the risks to the levels 
assessed have been carried out and documented. 
Safety organizations cannot, however, assure safe 
operation ; they can only assure that the safety risks 
have been evaluated by approved, proper, rigorous, 
quantitative, and objective methods, and that the 
system configuration and its operation are being 
controlled to those risk levels. 
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5 National Space Transportation 
System Risk Assessment and Risk 
Management: Discussion and 
Recommendations 


5.1. CRITICAL ITEMS LIST RETENTION 
RATIONALE REVIEW AND WAIVER 
PROCESS 


The Committee views the NASA critical 
items list (CIL) waiver decision making process 
as being subjective, with little in the way of 
formal and consistent criteria for approval or 
rejection of waivers. Waiver decisions appear 
to be driven almost exclusively by the design- 
based FMEA/CIL retention rationale, rather 
than being based on an integrated assessment 
of all inputs to risk management. The retention 
rationales appear biased toward proving that 
the design is “safe,” sometimes ignoring sig- 
nificant evidence to the contrary. 

Although the Safety, Reliability, and Quality 
Assurance (SR&QA) organizations of NASA 
collect, verify, and transmit all data related to 
FMEA/CIL and hazard analysis results, the 
Committee has not found an independent, 
detailed analysis or assessment of the CIL 
retention rationale which considers all inputs 
to the risk assessment process. 

As set forth in the NASA documents identified 
in Section 3.1, both the performance of the Failure 
Modes and Effects Analysis (FMEA) and the iden- 
tification of critical items are intended to be carried 
out under the aegis of the reliability function. In 
principle, the FMEA should be both a design tool 
to provide an impetus for design change, and a 
tool for the evaluation of the final configuration in 
order to define the necessary control points on the 


hardware. The identified critical items would re- 
quire supporting retention rationale and waivers 
as appropriate in order to be included in the overall 
as-flown system configuration. How this retention 
rationale was to be generated, who developed it 
and who evaluated it against what safety criteria 
became crucial questions for the Committee’s re- 
view' of the whole process. 

According to prescribed procedures, the hazard 
analyses being performed by the safety function of 
SR&QA, and the FMEA and CIL identification 
performed by the reliability function, w'ere to come 
together in the generation of Mission Safety As- 
sessment (MSA) reports which would contain 
analyses and justification of the retention rationale 
for the critical items and their associated “hazards”, 
as well as a safety-risk assessment of the resulting 
units, subsystems, and systems. The hazard analysis 
and Mission Safety Assessment parts of this overall 
safety and risk assessment process as it was sup- 
posed to be done prior to 1986 are shown in Figure 
5-1, obtained from JSC’s SR&QA. 

As Figure 5-1 indicates, according to specified 
NASA procedure the CIL retention rationale is to 
be used as one of many inputs to the more com- 
prehensive hazard analysis. In reality, how r ever, the 
hazard analysis is often simply a derivative of the 
CIL and its retention rationale, and is not used as 
a major basis for waiver decisions. Examination 
by the Committee showed that often these retention 
rationales were simply discussions of the hard- 
ware’s specifications, design, and testing. They were 
generated primarily by the functional development 
engineers responsible for the design. They are 
intended to be justifications, and do not, in our 
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the Challenger accident (NASA JSC SR&QA) 


















view, provide a true assessment of the risk of the 
hazards. 

Sometimes the rationale appears to be simply a 
collection of judgments that a design should be 
safe, emphasizing positive evidence at the expense 
of the negative, and thus docs not give a balanced 
picture of the risk involved. For example, the CIL 
retention rationale of December 1982, for the Solid 
Rocket Motor (SRM) indicated in support of re- 
tention that: there had been no failures in three 
qualification, five development, and ten flight mo- 
tors; there had been no leakage in eight static firings 
and five STS flights; 1076 Titan III joints (presum- 
ably of similar design) were tested successfully; etc. 
Missing from the retention rationale was, among 
other points, any discussion of the dissimilarities 
between the SRM and Titan 111 (e.g., insulation 
design and combustion pressure on the O-ring); 
the O-ring erosion observed in the Titan 111 program 
and on the second STS flight; a failure during an 
SRM burst test; and, since the rationale was not 
updated, all of the O-ring anomalies seen after 
December 1982. Furthermore, in many cases we 
reviewed: 

• No specific methodology or criteria are estab- 
lished against which these justifications can 
be measured. 

• The true margins against the failure modes 
often are not defined or explicitly validated. 

• The probability of the failure mode is never 
established quantitatively. 

• Design “fixes' 1 are accepted without being 
analyzed and compared with the configuration 
they are replacing on the basis of relative risk. 

The point is worth reiterating: The retention ra- 
tionale is used to justify accepting the design “as 
is”; Committee audits of the review process dis- 
covered little emphasis on creative ways to elimi- 
nate potential failure modes. 

Since 51-L, there has been a major increase in 
the attention and resources given to STS SR&QA 
and risk assessment and management functions at 
all levels of NASA and its contractors. In 1986, 
NASA appointed an Associate Administrator at 
Headquarters for Safety, Reliability, Maintainabil- 
ity, and Quality Assurance (SRM&QA) and charged 
him with establishing a NASA-wide safety and risk 
management program. To implement this program, 
policy directives are being developed relating to 


various procedures and operational requirements. 
Specific instructions and methodologies to be used 
in the conduct of various analyses and assessments, 
such as hazard analyses, are being developed. 
Independent institutional assessments and audits 
will be made of SR&QA activities and technical 
effectiveness at each NASA center. 

Some important elements of this revamped NASA 
safety program — including hazard analysis and 
mission safety assessment — are depicted in Figure 
5-2, which was obtained from the JSC SR&QA 
organization in May 1987. Several things shown 
in the figure should be noted. First, there is now a 
specific new set of NSTS instructions to all con- 
tractors and NASA organizations for conducting 
hazard analyses, and for preparing FMEAs and 
CILs for the NSTS (these new instructions affect 
the activities in the boxes in Figure 5-2 marked *). 
Second, it can be seen that the FMEA/C1L docu- 
ments are intended to be one of many inputs into 
the hazard analysis and Hazard Report, which in 
turn are shown as an input into the Mission Safety 
Assessment. 

However, since (as discussed in Section 4.2) the 
Hazard Reports do not provide a comprehensive 
risk assessment, nor are they even required to be 
an independent evaluation of the retention rationale 
stated in the CILs, the Committee believes that 
NASA plans — at least for the near term — to con- 
tinue using the retention rationale of the CILs 
directly and individually as the basis for Criticality 
1 and 1R waiver justifications to Levels II and I. 
We have indicated this by adding the Criticality 1 
and 1R w r aiver path within the dashed lines on the 
left side of Figure 5-2. The current plan is to take 
the critical item waiver requests to the PRCB and 
Level I via a data package prepared by JSC SR&QA. 
It is our impression, however, that most of the 
arguments in this data package will still basically 
be those contained in the original CIL retention 
rationale. Thus, we see too little in the way of an 
independent detailed analysis, critique, or assess- 
ment of the risk inherent in Engineering’s rationale. 

Since mid- 1986, NASA and its contractors have 
been performing a massive rework of all STS 
program FMEAs, updating the resulting CILs, and 
reviewing all prior HAs. This new' FMEA/CIL effort 
has had value in identifying new failure modes that 
were missed earlier or introduced through past 
changes, and those resulting from new 7 changes 
made mandatory before next flight. However, the 
new NSTS instructions for preparing FMEA/CILs 
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FIGURE 5-2 NASA JSC safety analysis, hazard reports, and safety assessment process in 1987, as modified 
by the Committee (adapted from NASA JSC SR&QA). 


(NSTS 22206) have also resulted in a large increase 
in the number of Criticality 1 and 1R items. The 
Committee believes this new complexity will pose 
additional severe problems for both the mechanics 
and credibility of the CIL and waiver processes. 


The strong dependence on the CIL retention 
rationales in waiver decisions makes it critical that 
they be comprehensive and up to date. It is not 
clear to the Committee whether, in the pre-51L 
environment, changes in the STS configuration or 


43 























the operational experience base led directly and 
surely to review and appropriate updating of the 
relevant CIL retention rationale. In the wake of 
the 51-L accident, the NSTS program issued a 
document (NSTS 22206) which is intended to 
strengthen the process for updating the retention 
rationale. Once a retention rationale has been 
accepted and a waiver granted for a critical item, 
any changes to the item itself, the FMEA, or the 
CIL that could affect the retention rationale mean 
that the CIL must be resubmitted to the Level II/I 
PRCB for its approval (NSTS 22206, p.2—7, 
para. 2. 2. 6). Any change, whether it be to the test 
environment, level, procedures, methods, or fre- 
quency, is to be reflected in changes to the retention 
rationale. If crew procedures are changed to reduce 
risk, corresponding changes are also to be made in 
the retention rationale. 

The question is whether this updating is con- 
ducted regularly and in a consistently rigorous 
fashion. Although this policy is new and may not 
yet have been fully imposed in all quarters, NASA 
and contractor personnel interviewed by the Com- 
mittee seemed variously uncertain about or una- 
ware of these requirements and how they are met. 
Updating the retention rationale seems to many to 
be considered a routine bookkeeping chore, of 
secondary importance, yet these rationales are the 
primary basis for granting waivers. 

During its audit the Committee developed a 
concern that the FMEA and associated retention 
rationale on a given critical item may sometimes 
fail to provide data in various important categories 
of information, such as the effects of environmental 
parameters. The lack of data in a certain case may 
or may not be significant with respect to the threat 
that item represents. Yet the absence of such data, 
even though it resulted in uncertainty, in the past 
has sometimes had the effect of bolstering the 
rationale for retention and providing unwarranted 
confidence in readiness reviews. This problem was 
especially in evidence with Mission 51-L. Data 
suggesting that temperature was a factor in the 
erosion of the O-rings did exist, but (according to 
the Rogers Commission) the relevant analyses ap- 
parently were considered to be inconclusive by 
those responsible, and these data did not appear 
in the retention rationale. Thus, the rationale im- 
plied that there were no data to suggest that 
temperature was a problem. Strengthening and 
closing the problem reporting loop since the acci- 
dent may well reduce the likelihood of similar 


future occurrences. Still, we note that the “negative 
answer” indicates uncertainty about the issue at 
hand. If the uncertainty is crucial to the decision 
process, then it implies the need for more experi- 
ments, tests or analyses to reduce the uncertainty. 
(Appendix E includes an analysis of the O-ring 
temperature effect and the uncertainty implied by 
extrapolation to low temperatures.) 

Thus, the Committee’s central concerns here are 
the reliance on and quality of the retention ration- 
ale, and the fact that we can perceive no docu- 
mented, objective criteria for approving or rejecting 
proposed waivers. CIL waiver decision making 
appears to be subjective, with no consistent, formal 
basis for approval or rejection of waivers. All items 
are considered and discussed at length during the 
CCB and PRCB reviews. It appears that, if no 
action item is generated as a result of the review, 
the critical item waiver is approved. There was no 
formal “approved or disapproved” step in meetings 
audited by the Committee, although we are in- 
formed that such approvals do appear in the 
minutes of the meetings. NASA managers empha- 
size that Level III engineers and their “Level IV” 
contractors are accorded a high level of responsi- 
bility and accountability throughout the program, 
and that their opinions and analyses are the real 
bases for making retention decisions; these engi- 
neers bear the burden of proving that the rationale 
is strong enough to justify retention and waiver of 
the item. 

However, the Committee believes that engineer- 
ing judgment on these matters is not enough. Such 
judgment is crucial, but it is often too susceptible 
to vagaries of attention, knowledge, opinion, and 
extraneous pressures to be the sole foundation for 
decision making. We are concerned that, for all 
the reasons discussed above, without professional, 
detailed evaluation against specific criteria for re- 
ducing risk (not just review by panels and boards), 
the retention rationales can be misleading or even 
incorrect regarding the true causes and probabilities 
of the failure modes for which retention waivers 
are being requested (see discussion of probabilistic 
risk assessment in Section 5.6). 

Recommendations (1): 

The Committee recommends that NASA estab- 
lish an integrated review process which provides a 
comprehensive risk assessment and an independent 
evaluation of the rationale justifying the retention 
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of Criticality 1/ 1 R and 2/2R items. This integrated 
review should include detailed consideration of the 
results of hazard analyses and all other inputs to 
the risk assessment process , in addition to the 
FMF.A/CIL retention rationale. Further, the review 
process should assure that the waivers and sup- 
porting analyses fully reflect current data and 
designs. Finally, NASA should develop formal, 
objective criteria for approving or rejecting critical 
item waivers. 

5.2 CRITICAL ITEMS LIST 
PRIORITIZATION AND DISPOSITION 


At present, in NASA instructions all Criti- 
cality I and 1R items are formally treated 
equally, even though many differ substantially 
from each other in terms of the probability of 
failure or malperformancc, and in terms of the 
potential for the worst-case effects postulated 
in the FMEA to be seen if the particular failure 
occurs. 

The large number of Criticality I and 1 R 
items at the time of the 5 1-L accident has since 
been substantially increased due to changes in 
ground rules for classification and the complete 
reevaluation of the entire STS. 

The Committee believes that giving equal 
management attention to all Criticality I and 
1R potential failures could be detrimental to 
safety if, as is the case, some are extremely 
unlikely to occur, or if the probability is very 
low that the postulated worst-case conse- 
quences of the failures will result. Treating all 
such items equally will necessarily detract from 
the attention senior management can give to 
the most likely and most threatening failure 
modes. 


Critical items in the Shuttle system are catego- 
rized according to the consequences of worst-case 
failure of that item. However, it has been the case 
that within each criticality category no further 
ranking is formally made. In practice, managers 
do sometimes discriminate within a category, e.g., 
in their decisions regarding those STS items which 
should be fixed prior to next flight. Prior to the 
5 1-L accident there were already 2369 Criticality 
1 and 1R items (the most critical) present in the 
Shuttle system. There has been a substantial in- 


crease in the number of such items, now estimated 
by NASA to be 4686, of which 2148 have been 
approved by the PRCB (Director, JSC/SR&QA, 
personal communication, November 10, 1987). 
This increase resulted from the reevaluation of the 
entire Space Shuttle system and the new ground 
rules specified for the preparation of FMEAs — e.g., 
the carrying of analyses down to the individual 
component level (even where multiple, identical 
components are involved) and the inclusion of 
pressure vessels which were formerly excluded (see 
Section 3.5.2). To take just one example, the 
number of Criticality 1 and 1R items in the SSME 
turbomachinery rose from 8 to 67 under the new 
ground rules. In view of this problem, NASA is 
now taking steps to prioritize the most critical 
items and will reevaluate the current scheme for 
defining levels of criticality. 

Initially, the reassessment process seemed to the 
Committee to be too heavily focused on Level I. 

1 he presence of a very large number of Criticality 
1 and 1 R items — even admitting that many are 
clustered with identical items — obviously places a 
heavy demand on the time and attention of key 
NASA decision makers and could prevent their 
penetrating deeply enough into the analyses sur- 
rounding each item to make a valid decision on all 
of them. We were concerned not only about the 
workload placed on Level I management, but also 
about the danger that crucial technical details might 
he lost or obscured as the rationale for retention 
was presented at successively higher levels. Al- 
though the same information is presented at the 
Level II and I PRCBs, it seemed entirely possible 
that technical debates occurring at lower levels 
might not be adequately relayed to Level I. 

A post-5 1L organizational change that shifted 
the Level II NSTS Program Director at JSC to Level 
I at Headquarters has alleviated these concerns to 
some extent. NASA recognized that the waiver 
decision-making flow was not ideal — especially 
from Level II to Level I. Consequently, the Level I 
NSTS Director (who also chairs the Level I PRCB) 
now participates in the Level II reviews as a basis 
for sign-off at Level I. Thus, there is now r a more 
direct “hand-off” of concerns and rationales from 
Level III to Level 1, via Level II. Nevertheless, the 
process still places a heavy workload on Level I, 
and there is still a danger that important technical 
information might be lost in transmission. 

The organizational change streamlined the waiver 
decision-making process, but it did not help in 
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handling the large number of Criticality 1 and 1R 
items. Many of these items differ substantially from 
each other in terms of the probability of failure or 
malperformance, and in terms of the possibility 
that the worst-case effects postulated in the FMEA 
will be seen in the event the particular failure does 
occur. (In this connection it might be noted that, 
prior to 5 1-L, 56 Criticality 1 failures occurred on 
the Orbiter during flight without any of the pos- 
tulated worst-case effects resulting.) Thus, the items 
vary considerably in their potential impact on 
Shuttle operational safety — i.e., on risk. 

Early in its audit the Committee began urging 
NASA to find a way to prioritize the Criticality 1 
and 1R items (see Appendix C, first interim report). 
NASA managers tended to assert that, since all 
Criticality 1 and iR items are (by definition) equally 
catastrophic in their consequences, all should be 
treated equally — and, indeed, we saw evidence in 
our audits that they were handled with equal 
attention. But it is the position of the Committee 
that giving equal management attention to all such 
items could be detrimental to safety if (as is the 
case) some are extremely unlikely to fail, or the 
probability is very low that the postulated worst- 
case consequences of the failures will result. The 
most likely and most threatening failure modes 
merit the most attention. It is illogical to dissociate 
the probability of an event or its consequences 
from decisions about the management of risk. 

For example, in the development of a probabil- 
istic risk assessment for a modern nuclear power 
plant, fault tree and event tree analyses typically 
identify several million potential sequences of events 
(including multiple independent failures and cas- 
cading failures) that can lead to core melt-down. 
Flowever, only 20 to 50 of these sequences con- 
tribute significantly to the risk, with five to ten of 
them contributing 90% of the risk. These particular 
sequences are exhaustively analyzed to identify 
ways to substantially reduce the overall risk. 

A secondary consideration of the Committee was 
the possible impact of the disclosure that, as the 
resumption of Shuttle operations nears, there are 
more Criticality 1 and 1R items (with all of them 
being waived) than there were before the accident. 
That perception would not be justified by, and 
would not fairly reflect, the real strides in system 
safety that have been made since 5 1-L. 

Responding to suggestions on the part of the 
Committee, NASA developed and tested a number 
of techniques that could be used to prioritize the 


CIL on the basis of the relative risk each item 
represents. One such scheme — termed the Critical 
Item Risk Assessment (CIRA) procedure — was se- 
lected and instructions for its implementation have 
now been promulgated throughout the NSTS pro- 
gram (NSTS 22491, June 19, 1987). 

The CIRA procedure is currently qualitative in 
nature — although it employs reliability and test 
data to some extent. It is based instead on judg- 
ments about the degree of threat inherent in dif- 
ferent risk factors. The Committee is concerned 
about the potential negative impact on the CIRA 
of ambiguous measures of risk and probability. 
Flowever, the technique does lend itself to the 
incorporation of more rigorous quantitative meas- 
ures of risk and probability of occurrence as these 
measures are developed for use within NASA. (See 
Appendix E for a discussion of CIRA and one 
approach to quantitative measures suggested by 
the Committee.) 

Current plans for the implementation of CIRA, 
spelled out by the NSTS Deputy Director (Program) 
in a memorandum dated July 21, 1987, are for 
STS project managers to prioritize the Criticality 
1 , 1 R, and 1 S items in each project after completing 
the FMEA/CIL reevaluation and presenting the CIL 
at the Level III CCB. By two weeks before Design 
Certification Review, each project manager will 
provide the NSTS Deputy Director (Program) with 
a list of ‘"the 20 items in his project that represent 
the greatest risk to the program. ” The Deputy 
Director will then compile and distribute a report. 
This assessment effort will run parallel to, and may 
not actually affect, the preparations for STS-26 
(the next scheduled Shuttle flight). However, “an 
alternate course of action” may be chosen for 
subsequent missions. The Committee views this 
implementation procedure with concern. It does 
not appear to reflect a serious concern on the part 
of the NSTS Program for the need to prioritize the 
CIL by assessing relative risks. 

Recommendations (2): 

The Committee recommends that the formal 
criteria for approving waivers include the proba- 
bility of occurrence and probability that the worst- 
case failures will result . We further recommend 
that NASA establish priorities now among Criti- 
cality 1 and IR items , taking care not to use 
ambiguous measures of risk and probability . NASA 
should also modify the definitions of criticality in 
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terms of the probability of failure ami probability 
of worst-case effects. Finally , we recommend that 
NASA Level I management pay special attention 
to those items identified as being of highest priority , 
along with the rationale that produced the priority 
rating. Responsibility for attending to lower-prior- 
ity items within the present Criticality I and 1R 
categories , when reclassified , should be distributed 
to Levels II and III for detailed evaluation and 
decision. 

5.3. HAZARD ANALYSIS AND MISSION 
SAFETY ASSESSMENT 


NASA hazard analyses currently do not 
address the relative probabilities of a particular 
hazardous condition arising from failure modes, 
human errors, or external situations. 

The hazard analysis and the mission safety 
assessment do not: address the relative prob- 
abilities of the various consequences which 
may result from hazardous conditions; provide 
an independent evaluation of the retention 
rationales stated in the input CILs; or provide 
an overall risk assessment on which to base 
the acceptance and control of residual hazards. 


Hazard analysis (HA) is intended to be a key 
part of NASA's safety and risk management proc- 
ess. Because it considers hazardous conditions, 
whatever their source, it is a top-down analysis 
that should encompass the FMEA and other bot- 
tom-up analyses and cover the safety gaps that 
these other analyses might leave. In reality, how- 
ever, the HA has not played the central role it was 
designed to play. Instead, the main focus has been 
on the FMEA and its corresponding CIL retention 
rationale. These are design-based analyses, pre- 
pared by the project engineering staff. (See Section 
5.L) 

The Committee’s audit of the FMEA/CIL re- 
evaluation and hazard analysis review produced, 
at first, a somewhat confusing and contradictory 
set of perceptions about the relationships between 
these safety analyses and the nature of the overall 
risk assessment and management process of which 
thev are a part. Gradually, it became clear that 
th ere were differences between the officially pre- 
scribed process and the real process, as well as 
differences in the way the process is perceived by 


various NASA personnel, depending on their func- 
tion and point of view. Beyond that, there were 
also differences among the NASA centers in the 
implementation at the detail level. 

Figure 5-1 (shown earlier), which was prepared 
by the Safety Division at JSC, depicts fairly accu- 
rately the process, as the Committee has come to 
understand it, that was prescribed by NASA policy 
at the time of the Challenger accident. Here, the 
HA is clearly an important element, buttressed by 
a number of complementary analyses including the 
FMEA/CIL. The ultimate product of the safety 
analysis is the Mission Safety Assessment (MSA), 
feeding into the deliberations of the various engi- 
neering and readiness review boards. Figure 5-3, 
also prepared by the Safety Division at JSC, shows 
the process from the perspective of that Division, 
focusing on the HA as the central activity. Note 
that the FMFA/CIL is listed as one of many inputs 
to the hazard analysis. The actual process appears 
to be quite different from the one suggested by the 
preceding two figures. 

During the latter part of 1986 and the first few 
months of 1987, our audit led to the impression 
that, although some of the FMEA/CILs were inputs 
into the HA function, the real risk acceptance 
process within NASA operated essentially as shown 
in Figure 5-4 (obtained from JSC). One can see 
from the diagram that the “Hazard Analysis As 
Required” is a dead-end box, with inputs but no 
output with respect to waiver approval decisions. 
Our impression was supported by subsystem proj- 
ect managers, engineers and their functional man- 
agement at JSC. Many of them believed that the 
CIL path shown in Figure 5-4 was the actual 
approval route for retention of designs with Crit- 
icality 1 and 1R failure modes. 

A key problem, in our view, is that the risk 
assessment shown in the box entitled “Retention 
Rationale and Risk Assessment” was not really an 
independent assessment of the risk levels by profes- 
sional system safety engineers; such individuals 
(and they are few in number within NASA) were 
“left out of the loop.” Neither did the assessment 
contain an evaluation of how system hazards re- 
sulting from critical item failure modes would be 
controlled. In practice, in most cases reviewed by 
the Committee, the retention rationales written on 
the CIL forms were simply transferred to the hazard 
analysis reports and became the basis for final 
acceptance of residual hazards, and for decision- 
making at Flight Readiness Reviews (FRRs). 
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FIGURE 5-4 The STS FMEA/CIL closed loop process provides for feedback on actions— the actual process 
(after Rockwell STS Div.). 









NASA does not use the HAs and (in turn) the 
MSAs as the basis for the Criticality 1 and 1R 
waivers. In fact, HAs for some important subsys- 
tems were not updated for years at a time even 
though design changes had occurred or dangerous 
failures were experienced in subsystem hardware. 
(An example is the 17— inch disconnect valves 
between the ET and Orbiter.) The Committee s 
audit showed that standards and detailed instruc- 
tions for the conduct of HAs were not found to he 
consistent throughout the STS program; NSTS 
22254 was issued to correct that problem. 

In summary, the Committee found in its review 
of the HA process that: 

1 . HAs were done for only the largest subsystems 
of the STS; they addressed certain overlays 
of hazards but were not traceable to all 
failures in units within the subsystems. 

2. HAs were not done routinely for each major 
subsystem. 

3. The HA assumed worst-case consequences 
and simply categorized hazard levels (cata- 
strophic or critical) based on whether there 
was time for counter-actions. 

4. The HA process called for an independent 
evaluation of the HA results. Analyses of 
catastrophic and critical hazards were to be 
verified using risk assessment techniques. 
However, the HAs did not address the relative 
probability of occurrence of various failures, 
based on actual flight and test information, 
nor did they evaluate the validity of the CIL 
retention rationale against any formal set of 
criteria. 

We found that many engineering personnel, 
functional managers, and some subsystem man- 
agers were unaware of what tasks must be done 
to complete the hazard analysis, did not know 
whether they had actually been done, and did not 
contribute to them. Some, in fact, believed that 
HAs were just an exercise done by reliability and/ 
or safety people and that they were redundant to 
the FMEA/CILs. Their belief appears to be justified, 
in that these HA activities did not seem to be 
authoritatively in-line as part of a true hazard 
control and risk management process. It appears 
they were carried out in a relatively sterile envi- 
ronment outside the mainstream of engineering. 


The safety personnel did use the HAs along with 
the FMEA/CILs to create Mission Safety Assess- 
ments for the major elements of the STS and for 
the overall missions. These MSAs were to provide 
“a formal, comprehensive safety report on the final 
design of a system.” However, in practice, the MSA 
reports essentially served as process assurance re- 
ports. They listed the hazards and stated whether 
they were eliminated or controlled; compared hard- 
ware parameters with safety specifications; speci- 
fied precautions, procedures, training or other safety 
requirements; and generally documented compli- 
ance with the various reliability and safety tasks. 
They did not provide in-depth quantitative risk 
assessments, and relied almost exclusively on the 
CILs and HA reports for justification of acceptable 
risks. 

New design changes and/or flight data were 
“examined” and “judged” for safety by various 
personnel and boards at NASA Levels III, II, and 
I; the vehicles for the approval of changes appear 
to have been the FRRs and various special reviews. 
The HA and MSA reports were not viewed as 
controlling documents on a specific system config- 
uration which was judged to be safe by the safety 
organizations. The initial waivers to fly Criticality 
1 and 1R items were not always redone in a timely 
way after new data were obtained. Thus, our audit 
supports the impression that the hazard analysis is 
not used to its fullest advantage and that overall 
system safety assessments, based on test and flight 
data and on quantitative analyses, are not a part 
of the process of accepting critical failure modes 
and hazards. 

Since the Hazard Report docs not provide a 
comprehensive risk assessment, or even an inde- 
pendent evaluation of the retention rationale stated 
in the input CILs, we believe the overall process 
shown in Figure 5-2, representing NASA’s current 
plans, has serious shortcomings. The isolation of 
the hazard analysis within NASA’s risk assessment 
and management process to date can be seen as 
reflecting the past weakness of the entire safety 
organization. For that reason, this issue of the role 
of hazard analysis drives to the heart of our most 
sweeping conclusion, which is that the information 
flow, task descriptions, and functional responsibil- 
ities implied by Figure 5-2 must be modified if 
NASA is to achieve a truly effective risk manage- 
ment process. The reordering of functions which 
the Committee recommends is described in detail 
in Section 5.11. 
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Recommendation (3): 

The Committee recommends that the FMEA/ 
(JLs he used as one of many inputs considered m 
the hazard analysis and system safety assessment. 
VV c also recommend that the overall system safety 
assessment encompass a quantitative risk assess- 
ment which in turn uses the CILs and hazard 
analyses as input. Finally, the Committee recom- 
mends that this risk assessment he the primary 
basis for retention or rejection of residual hazards 
as well as critical items. 


5.4 RELATIONSHIP OF FORMAL RISK 
ASSESSMENT PROCESS TO SPACE 
TRANSPORTATION SYSTEM 
ENGINEERING CHANGES 


Elements of formal risk assessment, such as 
FMEA/CILs and hazard analyses, appear to 
have had little direct impact on the STS re- 
covery engineering process as they have not 
figured prominently in the majority of engi- 
neering change decisions made hv NASA man- 
agement. 


The foregoing sections have addressed the rela- 
tionship between EMEA/ CIL and hazard analysis, 
and their relationship to the CIL retention rationale 
review and waiver decision-making process. It is 
important also to take a broader perspective and 
examine the relationship of the risk assessment 
process, as a whole, to the actual STS engineering 
redesign activity and recovery process. 

Shortly after the Challenger accident, groups 
representing various parts of NASA (design centers, 
Astronaut Office, etc.) presented the NSTS Program 
Manager at JSC with their lists of items deemed to 
require attention. All were Criticality I or IR 
items. From these lists, the JSC Level II Program 
Requirements Control Board selected 90 (consist- 
ing of hardware, software, and procedures) to 
undergo redesign, test, or analysis before the next 
flight of the Shuttle. 

These decisions were made without formal ref- 
erence to the FMEA. Since that time, the number 
of mandatory next-flight changes across the STS 
system has grown to 159. Of these, only a handful 
have the FMEA/CIL/retcntion rationale (or the 
hazard analysis) listed as the original source of the 


change (c.g., I out of 23 on the SSME, 4 out of 
48 on the Orbiter). Only a few of the mandatory 
changes have arisen out of the current FMEA/CIL 
reevaluation. Indeed, the redesign activity has, for 
the most part, preceded these reevaluations. Most 
of the mandatory changes were longstanding con- 
cerns, identified before the 51-L accident, which 
were derived from flight experience, engineering 
analysis, etc. 

NASA and contractor personnel told the Com- 
mittee that the stand-down provided an opportu- 
nity to address known hazards — things that were 
already “in the mill” before the accident. Thus, the 
FMEA/CIL and hazard analyses seem not to have 
affected STS engineering very significantly. Yet the 
FMEA/CIL reevaluation and the hazard analyses 
were the heart of the mandate the Committee (via 
NASA) received from the Rogers Commission in 
its recommendation III (see Appendix B). 

hor this reason, the Committee was concerned 
as it gained an increasing impression that the 
FMEA/CIL and hazard analyses are fairly narrow 
parts of the overall S I S risk management/reliability 
picture. I he special System Design Review Boards 
established in March 1 986 to review design changes 
slated for completion before the next flight appar- 
ently did not take the FMEA/CILs formally into 
account. As discussed in Section 5.3, the hazard 
analyses in actual practice appear to have little or 
no influence on the waiver decisions to accept 
Criticality 1 and 1R designs for flight. Also, the 
original scheduling of the first flight some six 
months after completion of the FMEA/CIL and 
hazard analysis reevaluations seemed to presuppose 
that no substantial design change requirements 
would result from the process. 

NASA and contractor personnel explained to the 
Committee that the FMEA/CIL is primarily a design 
tool , used as an input to Preliminary Design Review 
in the early days of the Shuttle program. In their 
view, the current reevaluation is essentially a design 
validation effort; thus, they say, the fact that it has 
disclosed few new critical items confirms the strength 
of the original design. Furthermore, they assured 
the Committee, engineering changes are processed 
through the same configuration control boards that 
review the FMEA/CIL, and the total process is not 
complete until the last change to be implemented 
before flight has undergone a FMEA and been 
dispositioned by the board. 

The Committee accepts this explanation. How- 
ever, accepting it forces us to conclude that NASA 
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may have overemphasized the importance of the 
FMEA/CIL reevaluation while simultaneously not 
giving sufficient attention to its results. Also of 
concern is the Committee’s continuing impression 
that the extensive FMEA/CIL effort has focused 
on a “moving target,” as the redesign work goes 
forward without adequate feedback into that proc- 
ess. For example, the contractor conducting an 
independent FMEA on the Orbiter (McDonnell 
Douglas) reported — and JSC confirmed — that per- 
sonnel conducting the FMEAs have had to utilize 
old “as-built” hardware drawings as a data base, 
telephoning engineers whenever they believe an 
item might have been modified since the original 
design. 

In its first interim report to NASA (see Appendix 
C), the Committee recommended that NASA take 
steps to ensure a close linking between the STS 
engineering change activities and the FMEA/CIL- 
hazard analysis processes. A subsequent revision 
in the change review procedure appears to be 
helping in that regard. It requires an assessment of 
each proposed design change to determine if any 
Criticality 1 or 2 hardware is affected. Furthermore, 
NASA’s Administrator has assured the Committee 
that flight schedule considerations will not be 
allowed to reduce the rigor with which reviews 
and analyses are conducted. The Committee is 
substantially reassured regarding the strengthened 
relationship between the risk assessment process 
and STS engineering changes. However, concerns 
remain regarding the long-term outlook for a strong 
connection between these activities, as Shuttle op- 
erations resume and engineering improvements 
continue. 

Recommendation (4): 

The Committee recommends that NASA take 
firm steps to ensure a continuing and iterative 
linkage between the formal risk assessment process 
(e.g. y FMFA/CIL and HA) and the STS engineering 
change activities. 

5.5 TIMELY FEEDBACK OF DATA INTO 
THE RISK ASSESSMENT AND 
MANAGEMENT PROCESSES 


The Committee has found many indications 
that data from STS inspection, test and repair, 


and inflight operations do not always feed 
back rapidly enough or effectively enough into 
the risk assessment and management proc- 
esses. 


One of the key failures that led to the Challenger 
disaster was that data regarding O-ring erosion in 
earlier flights had not surfaced with enough visi- 
bility or in a timely enough fashion to impact the 
O-ring CIL retention rationale or the Flight Read- 
iness Review for that ill-fated mission. The Com- 
mittee has found numerous indications that data 
from STS inspection, test and repair, and inflight 
operations do not always feed back rapidly enough 
or effectively enough into the risk management 
process. For example, with a high Shuttle flight 
rate (such as the rate of one per month being 
experienced just prior to 51-L), there may be a lag 
of two or more flights before m-flight anomalies 
are reviewed by the responsible NASA managers. 

A primary issue here is the feedback of opera- 
tional experience, inspection, test and repair re- 
ports, data and anomalies into the FMEA and the 
CIL retention rationale, and their impact on waiver 
and commit-to-launch decisions. Information that 
could affect the CIL waiver retention rationale 
often appears in other parts of the system long 
before it finds its way into the rationale for reten- 
tion. For example, the SSME prime contractor has 
set up a board (Rocketdyne’s Engineering Review 
Board) to disposition every item identified as trou- 
blesome by the project engineers. However, the 
relevant CIL number and document is identified 
only after disposition is made. Similarly, the effects 
of activities such as inspection, test and repair, and 
inflight operations appear not to be adequately 
accounted for in hazard analyses. 

Furthermore, it is not clear to the Committee 
what processes exist for methodically incorporating 
operational experience into performance analysis 
programs and the system change process, or into 
the FMEA/CIL. Mission Operations Directorate 
(MOD) personnel at JSC have been heavily involved 
in the FMEA/CIL and hazard analysis reevalua- 
tions, and 14 astronauts have been assigned to 
safety functions such as FMEA/CIL. This involve- 
ment in reviews leads to the development of flight 
rules, which, as one astronaut noted, is an effort 
to address a problem through procedural changes 
when it is too late for design changes. However, 
flight rules and procedures development often do 
lead to system design changes. (The Director of 
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MOD described 28 such changes made during 
1985 and 1986.) 

Another critical problem is the need to provide 
rapid feedback of information on anomalies de- 
tected during inspections, tests, and repairs as well 
as those occurring in flight, into the Flight Readiness 
Review (FRR) and the commit-to-launch decision. 
For example, in the past, information from the 
previous STS flight was not available in time to 
influence the decision to launch the next mission. 

There is a well-established process for handling 
and reporting in-flight anomalies. Once detected, 
an anomaly is evaluated and tracked by a Mission 
Evaluation Team (MET) (or the equivalent). A 
Problem Report (PR) is prepared on each anomaly 
which includes data and analysis regarding the 
fault isolation and its possible resolution, and 
potential effects on future flights and schedules. 
The PR is then reviewed, evaluated, and approved 
by the relevant project organizations, SR&QA, and 
the NS TS Deputy Director (Program). The PRs and 
the status of their resolution are tracked in the 
Problem Reporting and Corrective Action (PRACA) 
System. Finally, all reported anomalies and other 
concerns are compiled into a list which is made 
available to the FRR Board for the next scheduled 
flight. 


The problem has been the delays in the feedback 
from anomaly detection on one flight to the FRR 
for the next flight. NASA has a “quick look” 
procedure for expediting the reportage of signifi- 
cant anomalies up the management chain, but some 
data will simply entail an irreducible lag. NASA 
intends, for the initial flights of the Shuttle after 
its resumption, to reduce all the data from each 
flight before launching the next one. However, 
after the first few flights, NASA plans to increase 
the flight rate to a point where the data stream 
from postflight activities will once again lag. Al- 
though vigilance will certainly remain higher for 
some time in the wake of the Challenger accident, 
the Committee is nonetheless concerned that the 
same dangerous preconditions will once again be 
present. 

NASA is now' establishing a new closed-loop 
accounting and review system known as the System 
Integrity Assurance Program (S1AP). (See Figure 
5-5) ; Among other things, this system will tie all 
Criticality 1, IR, and IS items (defined in Section 
3.4. 1 and Table 3-1) to findings in the field. A key 
feature of SIAP is its Program Compliance Assur- 
ance Status System (PCASS). This is essentially a 
computer-based information system for the SIAP. 
Still being developed, the PCASS will function as 
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a central data base that integrates a number of 
existing information systems and sources across 
the NSTS (Figure 5-6). For example, the PRACA 
system mentioned above will be a part of it, 
speeding the transmission of data on flight anom- 
alies. 

The PCASS has the potential to provide in near 
real-time, to decision makers such as the parti- 
cipants in the FRRs, an integrated view of the 
status of problems with the STS, including trends, 
anomalies and deviations, and closure information. 
However, the PCASS will be ineffective unless 
inspection, repair, test, flight, and other data are 
fed into the system in a timely manner, and the 
data are available promptly in convenient, usable 
form. For example, delays in reporting on anom- 
alies and trends from previous flights can jeopardize 
proper decisions to launch the next flight. 

The Committee believes that the SIAP, including 
the PCASS as an integrated data base, can and 
should become a central element of STS risk as- 
sessment and management. However, great care 
must be taken to assure that the data base is 
correctly and adequately maintained. 

Essential to the successful assessment and man- 
agement of risk is the certain and timely feedback 
of preflight, flight, and postflight system perform- 
ance data; along with inspection, test and repair 
data; test results; and failure or degradation re- 
ports. Thus, a prime need recognized by NASA 
managers is to ensure that all problem actions are 
promptly placed in the PRACA/PCASS system. In 
many cases this involves a strong reliance on the 


thoroughness of maintenance and handler person- 
nel as well as project engineers. The paperwork 
burden on NASA technical and safety personnel is 
already enormous. But the timely and diligent 
reporting and the proper evaluation of such data 
are among the most important tasks they can 
perform. It is precisely where the system broke 
down in the months preceding 51-L. 

Recommendations (5): 

The Committee recommends that high-level NASA 
management attention and priority be given to 
increasing the efficiency of the flow, analysis, and 
use of inspection, test and repair, test results, and 
in-flight operations data throughout the decision- 
making process. The Committee also recommends 
that full implementation of the System Integrity 
Assurance Program (SIAP), including its Program 
Compliance Assurance Status System (PCASS), be 
given a high priority. Diverse professionals (e.g., 
design and development engineers, operating per- 
sonnel, statistical analysts) should be used in the 
development of this program, with maximum pos- 
sible early involvement by potential users and key 
decision makers. The Committee further recom- 
mends that procedures be implemented to ensure 
that all mission anomalies detected in real time and 
from recorded events, and those detected during 
the near-term inspection of recovered hardware, 
also are fed into the formal risk assessment and 
management processes for action prior to commit- 
ting to the next flight. Finally, the Committee 



FIGURE 5-6 Data base elements of the NASA NSTS Program Compliance Assurance and Status System 
(NASA). 
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recommends that all such anomalies he called to 
the immediate attention of launch decision makers 
who will justify in writing their decisions regarding 
the disposition of the anomalies. 


5.6 THE NEED FOR QUANTITATIVE 
MEASURES OF RISK 


Quantitative assessment methods, such as 
probabilistic risk assessment, have not been 
used to directly support NASA decision mak- 
ing regarding the STS, although quantitative 
analyses and test data often are used in arriving 
at qualitative subjective judgments in reaching 
decisions. Powerful methods of statistical in- 
ference are now available which allow the 
integration of all sources of information on 
risk, including data on partial degradations 
and failures as well as engineering models of 
failure modes. 

NASA is not adequately staffed with spe- 
cialists and engineers trained in the statistical 
sciences to aid in the transformation of com- 
plex data into information useful to decision 
makers, and for use in setting standards and 
goals. 


The key technical decision makers in NASA 
operate as chairmen of bodies that review relevant 
technical information. Their decisions involve re- 
quirements, design, waivers, launch decisions, etc. 
Much of this information is in the form of complex 
engineering data. Data are routinely collected from 
flight and ground tests, part changeout and failure 
histories, anomaly reports, computer simulations, 
and other sources. Some of these data are used in 
various ways for design qualification, system cer- 
tification, and configuration control. They are also 
used to establish or verify redlines and safety 
margins. They are sometimes employed in the 
FMEA to support rationales for retention, and in 
the hazard analyses to support classification of a 
hazard. They may come into play in the waiver 
process and the Flight Readiness Reviews. In other 
words, numbers and statistics appear throughout 
the risk management process, but they are generally 
used as raw data, and in a qualitative way. Nu- 
merical data have not normally been used directly 
to generate indicators of risk or reliability. Even 


trend analysis, a relatively simple statistical tech- 
nique for anticipating failures, has not been em- 
ployed routinely or to maximum effectiveness. 

The Committee was informed by a number of 
NASA persons during discussions that early in the 
history of the Apollo program a decision was made 
not to use numerical probability analyses in NASA’s 
decision-making process. This disinclination still 
prevails today. As a result, NASA has not had the 
benefit of more modern and powerful analytical 
assessment tools that have been developed in recent 
years, and that are used by other high technology 
organizations, such as in the communications and 
nuclear power industries. Without such tools, it 
would be very difficult at best for safety engineers 
to transform the massive data base which has 
developed in the STS program into specific infor- 
mation regarding what was truly known and what 
was not known. In addition, the failure to use 
numerical probability analyses had the unfortunate 
effect of denying NASA designers the required 
statistical data base on various types of failures, 
along with the better understanding of the mech- 
anisms of failures that can be obtained from such 
data. 

Quantitative approaches to the overall analysis 
of risk in complex systems are known by various 
names, such as quantitative risk assessment and 
probabilistic risk assessment (PRA); we use the 
latter here. Using modern techniques of statistical 
inference in combination with engineering models 
of failure modes and system models, these ap- 
proaches have become sophisticated and powerful 
in recent years. They are employed by the nuclear 
power, aircraft, and communications industries, 
the military aerospace sector, and other developers 
and operators of complex systems. While these 
quantitative approaches are not a panacea, since 
not everything affecting flight safety can be rigor- 
ously quantified, they can permit more objective 
assessment of the varying types and quality of 
information and data which are available as well 
as reflect the uncertainties introduced by incomplete 
data or knowledge. 

An approach to statistical inference that is par- 
ticularly useful for assessing risk is the Bayesian 
approach (using, for example, Weibull, binomial, 
or Poisson likelihood functions). This allows the 
integration of information from a variety of sources, 
such as industrial data on components and mate- 
rials, test data, analytical engineering models, field 
data, and qualitative engineering judgment. The 
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Bayesian approach (see Appendix D for more 
details) produces a “State of Knowledge Curve” 
(technically a probability density) for the parameter 
of interest, such as the frequency of a Criticality 1 
failure. The curve provides an estimate of the 
frequency and measures the uncertainty in the 
estimate. If only the data from the few or zero 
observed failures during flights were used, then the 
uncertainty would be too large to be useful. But 
the relevant information goes well beyond that 
scant data base. For example, it may include a 
model of the mechanism which would cause the 
failure mode. This cause model may involve loads 
and safety margins whose uncertainties have been 
well characterized by existing engineering data 
bases or carefully designed margin validation tests. 
Suppose, however, that after a complete analysis, 
the uncertainty about the frequency spans both the 
safe and unsafe regions of the frequency scale. This 
is not a sign that the analysis has failed, but it is 
an indicator that more (carefully designed) tests 
are needed. The experience and intelligence of the 
subject matter experts has already been fully re- 
flected in the Bayesian analysis; so it is inappro- 
priate to ask them now to resolve the uncomfortable 
uncertainty. Only new information will do. If the 
State of Knowledge Curve spans primarily the 
unsafe region of the frequency scale, then a design 
or procedure change is required. But if the safe 
region of the frequency scale carries all the uncer- 
tainty, then the uncertainty itself is of little con- 
sequence because the risk is now low enough to 
fly. 

Probabilistic risk assessment identifies all possi- 
ble failure scenarios along with their probabilities 
of occurrence and their consequences. The methods 
used in PRA to identify and organize these scenarios 
into a structured pattern variously include the use 
of master logic diagrams, fault trees, event trees, 
and FMEAs, among others. Since NASA has a 
great deal of experience with FMEAs in the design 
process, it is logical that they be a principal input 
to the PRA. Among the pay-offs to NASA from 
using PRA is that literally thousands of scenarios 
and their associated risks can be eliminated from 
further consideration in the hazard analysis and 
other risk assessment processes, if their contribu- 
tions to total risk and/or their probability of oc- 
currence are extremely low. (The specific limits 
should be set by the top management of NASA. 
However, failure scenarios that contribute less than 
0.01 percent of the total risk or have a probability 


of occurrence of less than 10 per flight would 
appear to be reasonable candidates for removal 
from further consideration.) Thus the proper use 
of PRA methods could significantly reduce the time 
and effort expended on risk assessment activities 
while, at the same time, identifying in a quantitative 
manner the most important contributors to overall 
risk. By concentrating on these priority items, 
NASA can reduce the overall risk and perhaps the 
total cost of risk assessment. 

Quantitative methods of analysis rely on the 
modeling of statistical data of many kinds. For an 
example of the application of a statistical technique 
called logistics regression to reveal a statistically 
significant trend and predict the probability of an 
STS event while specifying the prediction uncer- 
tainty, see Appendix E. It is essential that such 
analyses be performed with the advice of profes- 
sionals who understand the full range of analytic 
tools available through the modern statistical sci- 
ences. There currently are not enough professionals 
in the statistical/analytical sciences among NASA’s 
civil service and contractor personnel to fully ana- 
lyze such data on a regular basis. One result of 
NASA’s early decision not to use a specific relia- 
bility or risk analysis approach (apparently because 
of the lack of a large statistical data base) was that 
NASA safety organizations were not staffed with 
professional statisticians or safety-risk analysts, and 
project engineers were not trained in modern sta- 
tistical analysis techniques. 

Partly in response to the Committee’s interim 
reports (Appendix C), NASA has begun taking 
tentative steps toward the use of modern proba- 
bilistic analysis and other analysis techniques. A 
NASA handbook on PRA is being written. Con- 
tractor studies have been initiated to conduct trial 
PRAs of the Orbiter Auxiliary Power Unit and the 
similar Hydraulic Power Unit in the SRB, as well 
as on the Shuttle main propulsion pressurization 
system. In addition, the Jet Propulsion Laboratory 
is conducting for NASA a study of ways to improve 
the SSME certification process. They are using a 
Bayesian approach with a Weibull likelihood func- 
tion. The prior distribution is derived from engi- 
neering models of failure mode life. The idea of 
integrating engineering models with techniques of 
statistical inference is very promising. Based on the 
results of these studies, NASA plans to assess the 
benefits and applicability of PRA to the STS risk 
management process. The new Associate Admin- 
istrator for SRM&QA has indicated that he will 
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personally evaluate the technique and develop and 
pursue a strategy for introducing it throughout 
NASA. 

The Committee is concerned that the test with 
this very limited sample — particularly with the 
evaluation criterion stated in the NASA response 
to our first interim report (see Appendix C), namely 
comparison of the PRA results with the (current) 
"mainline FMEA/CIL activity” — could give a dis- 
torted result and lead NASA not to introduce PRA. 
We have cautioned NASA not to evaluate PRA 
merely by comparing the results of two or three 
disparate tests of PRA with the results obtained 
earlier through the FMEA/CIL process. The crite- 
rion should not only be whether a significant new 
problem is identified by the PRA. What should be 
asked is whether PRA would have helped in making 
NASA's original decisions (e.g., regarding the waiver 
on a Criticality 1 item), or would have given 
increased confidence in the decisions that were 
made. The PRA also should improve the under- 
standing of the nature of the failure modes, and 
increase the confidence in and objectivity of the 
assessment of risk. 

The judgment of experienced engineering prac- 
titioners is crucial for ensuring system safety. How- 
ever, a complex risk assessment process can actually 
obscure some of the prime contributors to risk. 
Probabilistic risk-analytic modeling techniques can 
provide decision makers with an input that clarifies 
the key choices facing them. Numbers and accom- 
panying analyses should not drive decisions di- 
rectly, but they can help ensure that system weak- 
nesses and problems "bubble up” for consideration 
and decision. Also, having available a detailed 
quantitative breakdown of risk does provide ex- 
perienced decision makers with a better basis for 
intelligently managing risk. Clearly, however, the 
Committee does not wish to suggest that NASA 
subordinate sound technical judgement to numer- 
ical analysis. Such an approach would be, in our 
opinion, unrewarding and perhaps counterprod- 
uctive. 

Recommendations (6): 

The Committee recommends that probabilistic 
risk assessment approaches be applied to the Shuttle 
risk management program at the earliest possible 
date . Data bases derived from STS failures , anom- 
alies , and flight and test results , and the associated 
analysis techniques , should be systematically ex- 


panded to support probabilistic risk assessment , 
trend analyses , and other quantitative analyses 
relating to reliability and safety . Although the 
Committee believes that probabilistic risk assess- 
ment approaches will greatly improve NASA's risk 
assessment process , it recognizes that these ap- 
proaches should not be a substitute for good 
engineering and quality control practices in design, 
development , test, manufacturing, and operations, 
all of which must continue to receive high priority 
emphasis by NASA and its contractors. The Com- 
mittee further recommends that NASA build up its 
capability in the statistical sciences to provide 
improved analytical inputs to decision making . 


5.7 THE NEED FOR INTEGRATED 
SPACE TRANSPORTATION SYSTEM 
ENGINEERING ANALYSIS IN SUPPORT 
OF RISK MANAGEMENT 


NASA safety-related analyses tend to focus 
primarily on single-event, worst-case failures 
to the relative exclusion of possible multiple 
and synergistic failures in different subsystems 
or elements of the STS. In addition, the con- 
nection between the various analyses appears 
tenuous. There does not appear to be an 
adequate integrated-system view of the entire 
STS. 


NASA's risk management process provides some 
mechanisms for identifying cross-element interface 
effects and failure modes, including propagation 
of failure modes to interfacing or physically adja- 
cent modules or subsystems. One mechanism is the 
Element Interface Functional Analysis (EIFA), de- 
scribed in Section 3.4.3. There are three EIFAs: 
Orbiter/ET, Orbiter/SSME, and Orbiter/SRB-ET (a 
fourth EIFA, on ground/flight systems, is now being 
generated). The hazard analysis is intended to be 
a top-down analysis that addresses cascading fail- 
ures. Interface Control Documents are a third 
mechanism concerned with safety at the subsystem 
interfaces. Finally, a Critical Functions Assessment 
(CFA), conducted initially in 1978 to identify 
critical functions during each mission phase, is 
currently being reevaluated by Rockwell Interna- 
tional. The CFA can include multiple and cascading 
failure combinations. 
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The NSTS Engineering Integration Office at JSC 
is responsible for managing system integration 
activities, the systems analysis and interface design 
effort, and analysis of integrated structural loads 
and thermal effects. As part of this responsibility, 
a scries of Level II Systems Integration Review 
(SIR) panels are assigned to review the FMEAs on 
both sides of an interface. The Office is supported 
by Rockwell International in the provision of Space 
Shuttle integration analyses — although Rockwell’s 
support responsibility apparently does not extend 
to some areas (e.g., on-orbit or reentry phases) or 
elements. The Engineering Integration Office, with 
the support of Rockwell, also produces Integrated 
Hazard Analyses (IHA) bridging two or more STS 
elements. 

To the extent that the hazard analysis is a top- 
down analysis, it is important that its output lead 
to the generation or modification of the FMEAs. 
But there is no indication that this is happening. 
For example, a member of the Committee audited 
the FMEA/CILs and hazard analyses related to 
potential interactions between the Orbiter fuel cells, 
water management, active thermal control, and life 
support subsystems; in particular, he looked for 
indications of possible effects of the presence of 
hydrogen in the cooling or potable water which 
would result from a failure of the hydrogen sepa- 
rator. The FMEA/CILs identified only two possible 
effects: degradation of the performance of the flash 
evaporator and a reduction of water storage ca- 
pability. Other, potentially more damaging effects 
not covered in the FMEA include: the effect of the 
possible shutdown of flash evaporators between 
140,000 and 100,000 feet on the active thermal 
control system; the violation of water quality 
standards, with resultant crew discomfort; and the 
inability to accurately assess the amount of water 
onboard. It should be noted that no hazard analysis 
seems to exist related to the potential presence of 
hydrogen in water; the Element Interface Func- 
tional Analysis is not applicable because all of the 
subsystems of concern are within the same element 
(the Orbiter). 

Although the FMEA/CIL is a bottom-up analysis, 
it should be able to expose cascading failures 
initiated by the subject failure. However, at present 
the FMEA process usually does not consider the 
cascading of failures beyond the first occurrence. 
For example, it will not consider propagation of a 
failure in the hydrogen separator into the flash 
evaporator and the subsequent propagation into 


the thermal protection subsystem. The FMEA/CIL 
ground rules restrict the analysis to individual 
subsystems. Contractor personnel do analyze the 
effects of a failure in the subject subsystem on 
other subsystems, but no further. 

External failures are considered in the redun- 
dancy screen, 4 but not in the FMEA. The Com- 
mittee notes the dichotomy between the concern 
with failure of redundant items, contrasted with 
the lack of concern in the FMEA over nearly 
simultaneous failures in separate subsystems which 
could have an equally critical effect. 

The prevailing impression of the Committee is 
that, although there are several mechanisms that 
take a partial systems view, and although the level 
of effort is much greater than it was prior to 51-L, 
the various analyses do not add up to a truly 
integrated, total-systems analysis in support of risk 
assessment. Nor are they linked to the FMEA/CIL 
in such a way as to compensate for its limitations. 
The existing risk management process consists 
primarily of separate, bottom-up lines of analysis, 
without a thorough top-down, integrated systems 
analysis. 

The Associate Administrator for SRM&:QA has 
been directed by the Administrator to develop a 
new agency-wide risk management system that 
integrates the various parts of the risk assessment 
and management process. This is a promising 
development. It is important for NASA to call 
attention to the totality of "risk management” as 
the sum of various processes, including total STS 
risk assessment, that ultimately must be considered 
on an integrated basis by line management as well 
as by SRM&QA. 

It may be noted that, of all the organizations 
and groups observed by the Committee, operations 
personnel (astronauts and flight controllers) appear 
to have the broadest and most integrated perspec- 
tive of the Shuttle system. Flight controllers in 
training have actually found real problems on 
spacecraft while performing cross-element analy- 
ses. The continuous development and updating of 
flight rules and procedures is an important source 
of this perspective. For example, the Mission Op- 
erations Directorate (MOD) flight rules sheet now 


4 The redundancy screen is a method lor documenting the capabilities 
for redundancy verification: A — capable of checkout during normal 
ground turn-around between flights. B— loss of redundant element is 
readily detectable in flight. C — there is a possible single event (e.g., 
contamination or explosion) which can cause loss of all redundancy. 
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lists the relevant hazards, FMEAs, and CILs in a 
matrix format. An experimental system being de- 
veloped by MOD — the Shuttle Configuration 
Analysis Program (SCAP) and Failure Analysis 
Program (FAP) — is able to simulate multiple fail- 
ures and their effects. This system could be useful 
in integrated risk analysis. 

Another strong example of the integrated, sys- 
tems engineering approach is the Avionics Audit, 
a scries of studies performed by Rockwell since 
1979 on selected avionics hardware, software, and 
Orbiter functions. An audit looks at failures across 
the STS, including cascading failures and interac- 
tions. The output of the audit is fed back into the 
FM E A/C II ./retention rationale, hazard analysis, etc., 
to ensure that they are consistent and complete or 
that a design change is implemented, with all 
relevant documents being revised accordingly. Both 
the Avionics Audit and the Critical Functions 
Assessment are promising techniques. However, 
they are presently not scoped broadly enough, nor 
are there enough highly skilled engineers available, 
with an understanding of both the STS and the 
audit techniques, to do the job. (We understand 
that there are tentative plans to expand the Avionics 
Audit to embrace the entire STS.) 

I he expansion of effort on integrated analysis is 
a positive sign. However, the Committee remains 
concerned that we have not found at Level II a 
consolidated, integrated STS systems engineering 
analysis, including system safety analysis, that views 
the sum of the Shuttle elements as a single system. 
We hope that, in attempting to develop an agency- 
wide risk management system, NASA will devise 
an integrated STS system analysis and assessment 
process which is closely coupled with the FMEA/ 
CIL and other components of risk management, to 
ensure assessment of the truly critical safety items 
in the STS. This would include all combinations 
of hardware, software, and procedural failures and 
malpcrformanccs, and cascading failures. Opera- 
tions personnel should be brought heavily into play 
in the development of such an integrated system 
evaluation. Finally, the safety/risk management 
process should be reviewed to identify ways to 
improve both the coordination of analysis efforts 
and the efficiency of the overall process. Care must 
be taken to assure that each part of the process is 
necessary and contributes significantly to the over- 
all STS risk management system. 


Recommendation (7): 

A “ top-down ” integrated system engineering 
analysis, including a system safety analysis, that 
views the sum of the STS elements as a single 
system should he performed to help identify any 
gaps that may exist among the various "bottom- 
up" analyses centered at the subsystem and element 
levels. 


5.8 INDEPENDENCE OF THE SPACE 
TRANSPORTATION SYSTEM 
CERTIFICATION AND SOFTWARE 
VALIDATION AND VERIFICATION 
PROGRAM 


In general, hardware certification and veri- 
fication, and software validation and verifi- 
cation of SI S components are managed and 
conducted primarily by the same organiza- 
tional elements responsible for the design and 
fabrication of the units. Thus, the independ- 
ence of the certification, validation, and veri- 
fication processes is questionable. For exam- 
ple: 

— The contractor that builds the Orbiters 
(Rockwell International, STS Division) is 
also responsible for preparing the docu- 
mentation and performing the work in- 
volved in certification, but does not answer 
to an entity independent of the NSTS 
Program with regard to the certification 
function. 

— At Marshall Space Flight Center (MSFC), 
the Engineering Directorate has the prime 
responsibility for design requirements for 
the propulsion elements of STS and also 
has responsibility for the review and ap- 
proval of their certification. The Program 
Office is responsible for the design and 
development phase as well as for perform- 
ing the certification activities. 

— At the Johnson Space Center (JSC), prime 
responsibility for design requirements, de- 
sign and development, and certification for 
the Orbiter all rest with the Program Office, 
supported by the Engineering and Opera- 
tions Directorates of the Center. 

— “Independent” validation and verification 
(IV&V) of software is carried out by the 
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same contractor (IBM) that produces the 
STS software, with some checks being 
made by the Johnson Space Center (JSC). 

STS certification methods and responsibilities are 
described in the Shuttle Master Verification Plan 
(NSTS-07700-10-M VP-01). This plan now is being 
revised to define reverification requirements which 
must be met prior to the return to flight. Figure 
5-7 depicts the phases of the process and respon- 
sibilities for preparation, review, and approval (i.e., 
by the contractor or NASA). Figure 5-8 shows the 
time sequence for the various aspects of the certi- 
fication-verification process for a subsystem, from 
the establishment of requirements to operations. 

According to the NASA Associate Administrator 
for SRM&QA, his office is responsible for devel- 
oping certification plans, reviewing the results, and 
approving the certification of STS. However, as the 
following discussion points out, the certification 
process is actually carried out by the NASA centers 
and their contractors who are building the STS. 
Although the general approach to certification is 
the same at the three centers involved in the STS 
program (JSC, MSFC, and KSC), there are several 
differences in detail, especially with respect to the 
degree of involvement of the SR&QA organizations 
(Director, JSC SR&QA, personal correspondence). 

At MSFC, the Engineering Directorate has the 
prime responsibility for establishing design require- 
ments and also for reviewing and approving cer- 
tification. The Program Office has responsibility 
for the design and development phase as well as 
for the performance of certification activities. Under 
the cognizance of the MSFC Chief Engineer, a lead 
engineer is designated for each element (ET, SRB, 
SSME) to oversee the certification activity. The 
MSFC SR&QA office reviews and approves all 
certification and verification documentation, and 
performs an independent verification assessment to 
insure that all STS elements for which MSFC is 
responsible are properly certified and qualified for 
flight. 

For the Orbiter, the JSC Program Office subsys- 
tem managers (supported by the Engineering and 
Operations Directorates of the Center) have prime 
responsibility for design requirements, design and 
development, and also the review and approval of 
all aspects of certification of hardware. However, 
the JSC SR&QA office is responsible for assuring 
the adequacy of all flight equipment through review 
and approval of all certification requirements, plans, 


and test reports. In the case of unresolved differ- 
ences between the Orbiter Project Manager and 
the JSC Manager of SR&QA regarding a certifi- 
cation issue, the appeal route is to the Director of 
JSC. As shown in Figure 5-7, the Orbiter element 
contractor (Rockwell International, STS Division) 
is responsible for preparing the documentation and 
performing the work involved in certification. 

At KSC, the verification program used during 
the establishment of the Shuttle Launch and Land- 
ing Site (LLS) was, because of the nature of that 
facility, quite different from that used for flight 
hardware. The LLS project at KSC certified that 
critical ground systems meet design performance 
requirements. KSC SR&QA and operating person- 
nel also participate in facilities, systems, and equip- 
ment certification. 

STS Orbiter flight software is developed by IBM 
under contract to NSTS/JSC. Another group of the 
same contractor, but not reporting to the devel- 
opment manager, carries out the independent val- 
idation and verification (IV&V) of the software 
produced by the development group. NASA per- 
sonnel consider the multi-organizational, multi- 
facility participation in software testing and veri- 
fication to be a strong feature of their procedure. 
They consider that IV&V is adequately performed 
in two stages: (1) by a group in IBM separate from 
the development group, and (2) through testing in 
the Shuttle Avionics Integration Laboratory (SAIL) 
at JSC. However, the Committee noted very close 
collaboration at JSC among NASA personnel and 
support contractors involved in software develop- 
ment, with little clear differentiation of roles and 
responsibilities. While such an atmosphere pro- 
motes teamw'ork and cooperation, it does not tend 
to promote the maintenance of adequate checks 
and balances required for truly independent IV&V. 

The Committee agrees that the existing software 
validation and verification process is well run, with 
good quality control, and we believe it should be 
retained. Indeed, performance of STS software has 
never created a problem in STS operations. How- 
ever, the Committee questions whether independ- 
ent validation and verification by a second group 
within the development contractor is sufficiently 
independent. The degree of independence certainly 
would lead to serious questioning by outsiders if 
significant problems were to develop in the flight 
software. The Committee further believes that the 
SAIL, while it may be a good end-to-end test, is 
not adequate to fulfill the purposes of IV&V. Also, 


60 


O « ORIGINATE 

RESPONSIBILITY DOCUMENTATION CONTENTS R . REVIEW 

A * APPROVE 



61 


FIGURE 5-7 Phases of the STS certification process and associated organizational responsibilities (NASA) 













REQUIRE- 

MENTS 


DESIGN 


FABRICATION / ACCEPTANCE 


FLIGHT TEST 


OPERATIONS 



members of the Committee were told by JSC that which prevails for military aircraft, in which 

representatives that, because of limited staff, the a totally separate organization is responsible tor 

JSC SR&QA organization now provides little in- both certification and software IV&V. It also is in 

dependent review and oversight of the software contrast with the process prevailing in the corn- 

activities in the NSTS program. mercial aircraft industry, where the Federal Avia- 

Based on the Committee’s review of STS certi- tion Administration is responsible for certification, 

fication-validation-verification processes, it appears The FA A uses “Designated Engineering Represen- 

that the work is managed and conducted primarily tatives” (DERs) who are employed by the airframe 

by the same organizational elements responsible manufacturer but are responsible to the FA A while 

for the design and fabrication of the STS units. serving as DERs. This approach provides for in- 

The SR&QA organizations seem to have a second- dependence of the certification process from the 

ary role. Thus, the degree of independence of the design, development and production of the air- 

SR&QA hierarchy in the certification process is planes, while bringing to bear the experience of 

questionable. This situation is in stark contrast to hands-on engineering practitioners. 
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Recommendation (8): 

Responsibility for approval of hardware certifi- 
cation and software IV&V should be vested in 
entities separate from the NSTS Program structure 
and the centers directly involved in STS develop- 
ment and operation. However, these organizations 
should continue to conduct activities supporting 
certification and IV&V. 


5.9 OPERATIONAL ISSUES 

Operational aspects of the NSTS program require 
considerable attention in risk assessment and man- 
agement. Three aspects are focused on here: Launch 
Commit Criteria waiver policy, human error as a 
contributor to risk, and cannibalization of spare 
parts at KSC. 

5.9.1 Launch Commit Criteria Waiver Policy 


An average of two Launch Commit Criteria 
(I.CCs) are waived by NASA in the course of 
each launch. The Committee questions the 
validity of an operational procedure that “in- 
stitutionalizes” waivers by routinely permit- 
ting established criteria to be violated. 


Launch Commit Criteria (LCCs) are technical 
requirements and conditions pertaining to the STS 
system, ground systems, and the physical environ- 
ment that must be met before a launch can proceed. 
NASA divides LCCs into three classes: mandatory, 
highly desirable, and desirable. However, all LCCs 
are subject to waiver based on the judgment of 
responsible NASA managers, and typically a few' 
(an average of two) are waived for each launch. 

To date, no LCC w r aiver has ever produced a 
problem on a Shuttle mission. However, Committee 
members questioned the validity of an operational 
procedure that “institutionalizes” waivers by rou- 
tinely permitting established criteria to be violated. 
There was a general feeling that “waivable” criteria 
are not valid criteria. 

NASA officials told the Committee that an av- 
erage of 2,000 LCCs come into play on a given 
Shuttle launch, so that the number w'aived per 
launch is an insignificant percentage of the total. 
The great majority of these are apparently not 
critical. Furthermore, they explained, in most cases 


NASA engineers know that there is some extra 
margin of safety between the LCC and the actual 
reasonable limits of safety, because they have 
learned more about the systems involved since the 
time the LCC was established. Thus, a typical LCC 
waiver represents fine-tuning — for example, a slight 
deviation in leak rates or pressurization rates. Few 
such waivers have ever led to design changes. The 
Committee is not persuaded by these arguments. 

As a result of the 5 1-L accident, NASA has begun 
revising the ground rules for waivers and reassess- 
ing the LCCs across the board. A time will be 
selected (probably launch minus 5 min.) beyond 
which waiver of an LCC cannot be executed unless 
contingency procedures are prescribed in advance, 
thus forcing a launch scrub. Furthermore, each 
waiver will now trigger a formal reassessment of 
the particular LCC that was waived, perhaps re- 
sulting in a change to it. 

Although these changes in policy are appropriate, 
there are aspects of LCC policy that the changes 
do not address. The Committee is uncertain about 
what criteria are used to establish LCCs initially, 
especially in the weather and environmental area. 
Lor example, ice on the pad at the time of mission 
5 1-L was later shown by films to be a serious 
hazard; yet there was no LCC governing icing. 
Similarly, there was not an LCC on temperature 
at the SRB O-rings — only an unrealistic (as it turned 
out) LCC on ambient air temperature. The Flight 
Readiness Review Board for that mission was aware 
of SRB O-ring erosion on past flights, but did not 
recognize the effects of temperature on the O-ring. 

At the same time, there is a concern that too 
much faith may be placed in the LCCs. A possible 
case in point is the Atlas Centaur launch failure of 
March 1987, in which a decision was made to 
launch the vehicle into a storm because lightning 
strikes at the time of launch appeared to be beyond 
the 5-mile range permitted by the LCCs. The Atlas 
was destroyed by lightning shortly after launch, 
and observers (including NASA personnel) later 
said that conditions were clearly not suitable for 
launch. 10 In the view of the Committee, LCCs are 
designed to permit launch; they should not be 
allowed to force a launch. Experienced judgment 
must continue to be exercised. But it would be 
useful in this regard if LCCs were more accurate 
and more comprehensive in their definition of 


NASA: Report of the Atlas Centaur — 67/FL.TSATCOM F-6 Inves- 
rigation Board, 15 July 19X7. 
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allowable limits; in that case they would not be so 
subject to waiver. 

We note the U.S. Air Force system for indicating 
the criticality of flight equipment by a “red cross” 
(a mandatory NO-GO), “red diagonal” (system 
not fully operational, but safe to fly), and “red 
dash” (some inspection not done). A comparable 
prioritization would be appropriate for NASA’s 
LCCs. Loss of an STS may be much more costly 
in dollars and lives than loss of any USAF system, 
and any means of focusing judgment should be 
welcome. There must be room for experienced 
judgment; but there must also be inviolable rules 
that prevent errors in judgment being made under 
pressure of time on certain critical LCCs. We 
recognize the objections of launch directors to 
inviolable criteria; but in our view the best launch 
director is one who is willing to be conservative 
and to live with a conservative system. 

The Committee welcomes the present review of 
LCC waiver policy. We believe that the presence 
of the newly appointed NSTS Deputy Director 
(Operations) will also help to ensure the application 
of experienced judgment and knowledge whenever 
LCC waiver decisions are being made. 

Recommendation (9a): 

The Committee recommends that NASA estab- 
lish a list of mandatory LCCs which may NOT be 
waived by anyone. This should comprise the bulk 
of the LCCs. A limited number of criteria would 
be separately listed , for special cases , together with 
a discussion of the circumstances under which they 
may be waived and who may make the waiver 
decision. 


5.9.2 Human Factors as a Contributor to Risk 


Human factors, which are considered in 
some of the STS hazard analyses, do not appear 
to be taken into account as the cause of failure 
modes in the FMEAs. Since the FMEA is one 
of the principal safety tools used in the eval- 
uation of the STS design, the Committee be- 
lieves that the STS design process should 
explicitly consider and minimize the potential 
contribution of humans to the initiation of the 
defined failure modes. 


NASA’s risk assessment and risk management 
process for the STS focuses primarily on failure of 
hardware, and secondarily on software faults and 
errors. Human error, which can be a major con- 
tributing factor in accidents, is accorded relatively 
little attention in the present risk management 
system although it is considered in some of the 
hazard analyses. While procedural aspects of STS 
operations are regularly relied upon to justify the 
retention of critical items, human factors do not 
appear to be taken into account as a source of 
failure modes in the preparation of the EMEAs. 
Human error can affect both flight operations 
(through crew operations and flight controller pro- 
cedures) and ground operations (testing, certifica- 
tion, maintenance, assembly, etc.). Hazard analyses 
can consider human error in both types of opera- 
tions activities; but the Committee has not found 
that hazard analysis is regularly used to assess this 
clement of risk. 

Procedures utilized in both ground and flight 
operations are controlled by formal Configuration 
Control Boards. Personnel are, of course, trained 
and certified for the operations that they will carry 
out. Procedures are verified by a variety of methods, 
including trainers, simulators, mockups, engineer- 
ing models, and analysis tools. 

The Committee initially had some concerns re- 
garding the lack of involvement of flight operations 
personnel in engineering redesign decisions and 
safety reviews, but through discussions with NASA 
personnel these concerns were largely resolved. 
However, we remain troubled by aspects of ground 
operations, with respect to their human error 
potential. We note that two of the three fatal 
spacecraft accidents in the U.S. manned space 
program to date occurred on the ground, of which 
one was caused by procedural errors on the part 
of the ground crew." Removal and replacement of 
parts, test, repair, and ail the various ground 
operations provide enormous potential for error 
that can lead to serious problems. The potential 
may be exacerbated by the fact that, at KSC, 
ground personnel are relied upon to report any 
errors they make which could induce damage; there 
is little incentive for self-reporting. 

A draft NASA Handbook on Systems Assurance, 
recently prepared by the Safety Risk Management 

11 Two Shuttle processing workers were asphyxiated and killed in late 
1986 during a test involving nitrogen gas. (The Apollo rtre in 1967 
was not caused by human error, but by a shorted wire which initiated 
a fire in the pure oxygen atmosphere.) 
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Program Office of Headquarters SRM&QA Safety 
Division, places new emphasis on human error in 
risk assessment. In a proposed risk assessment 
model (Figure 5-9), sensitivity to human error is 
presented as one factor that contributes to the 
likelihood of a failure mode occurring. This is a 
positive sign, but it now is far from being imple- 
mented in the fabric of NASA system design and 
safety assurance. 


Recommendation (9b): 

The Committee recommends that the NASA 
TMEA include human factors among the recog- 
nized sources of potential causes of failure modes. 
This step would provide another valid link between 
the EMEA and the hazard analysis , which are now , 
in our view , too tenuously connected . 

5.9.3 Cannibalization of Spare Parts 


By the time of the Challenger accident, 
“cannibalization/' the removal of parts at the 
Kennedy Space Center (KSC) from one oper- 
ational SIS element to fulfill spares require- 
ments in another, had become a prevalent 
feature of STS logistics, thus introducing a 
variety of failure potentials associated with 
human error. Cannibalization is not evaluated 
as a producer of potential failure in either the 
hazard analysis (where it would be most ap- 
propriate) or the FMEA. 


NASA initiated a spares program in 1981, as 
Shuttle test flights began. Early flights were sup- 
ported with spare parts produced on order, a source 
of trouble since parts were often not available in 
a timely fashion. After other Shuttles came on line 
and as the flight rate increased, parts shortages 
became increasingly severe. Cannibalization was 
often the only answer to meet the flight-rate de- 
mand. 

As the President of Rockwell International STS 
Division said to the Committee, ‘An the last year 
of flight, cannibalization was the name of the game. 
We were robbing Peter to pay Paul all throughout 
the system.” With budgetary constraints and cost 
overruns a chronic reality, NASA apparently de- 
cided to emphasize STS fabrication and launchings 


over purchasing adequate spare units; the result 
was logistics problems. 

From a safety standpoint, cannibalization raises 
many problems. First, having workers enter one 
vehicle and remove a part presents the danger that 
they will inadvertently (and perhaps unknowingly) 
damage an adjacent part of the vehicle. Second, 
there is the risk that the part itself will be damaged 
upon removal and transport. Third, there is the 
chance that the part will be improperly replaced 
in the vehicle for which it was cannibalized as well 
as in the original vehicle when the part is returned 
or replaced. The latter two possibilities are theo- 
retically covered by post-installation checkout and 
inspection, but the risk of error increases as the 
incidence goes up. Workers are required to report 
any possible damage they cause, but the “honor 
system” may not be 100% reliable. Finally, can- 
nibalization per se is not explicitly evaluated within 
the hazard analysis process. 

Figure 5-10 shows the incidence of cannibali- 
zation over approximately the last year before the 
accident. It can be seen that at least one-third of 
the Orbiter Line Replaceable Units (LRUs) flown 
on some missions were obtained through canni- 
balization. A NASA official at KSC told the Com- 
mittee that the problem of spares had become so 
acute that, if Shuttle flights had continued uninter- 
rupted, KSC would not have been able to sustain 
STS operations. 

The flight hiatus has given NASA time to improve 
the spares inventory and to make some needed 
changes in logistics management. Responsibility 
for (Arbiter logistics has been assigned to KSC. The 
spares budget has been increased. Furthermore, 
there has been a sharp drop in planned flight rate, 
which should reduce the requirement for canni- 
balization. Also, stricter management controls have 
been placed on cannibalization, making it unlikely 
that personnel will readily resort to this practice. 
The program hopes to achieve a level of support 
in which lack of spares would delay processing no 
more than 5 percent of the time (the aerospace 
industry standard). The new NSTS System Integrity 
Assurance Program specifically prohibits cannibal- 
ization except by approval of the chairman of the 
PRCB, and requires the collection and analysis of 
supportability trend data in support of logistics 
management. 

Reducing the repair time for spare parts is the 
fastest way to improve the inventory and reduce 
cannibalization. The repair processing time is cur- 
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rentlv too long, but a gradual reduction in flow 
time is expected to occur. 

Recommendations (9c): 

The Committee recommends that NASA main- 
tain its current intense attention toward reducing 
cannibalization of parts to an acceptable level. We 
further recommend that adequate funds for the 
procurement and repair of spare parts be made 
available by NASA to ensure that cannibalization 
is a rare requirement. Finally, we recommend that 
NASA include cannibalization, with its attendant 
removal and replacement operations, as a potential 
producer of failure in the integrated risk assessment 
recommended earlier (Section 5.1). 


5.10. OTHER WEAKNESSES IN RISK 
ASSESSMENT AND MANAGEMENT 

5.10.1 The Apparent Reliance on Boards and Panels 
for Decision Making 


The multilayered system of boards and panels 
in every aspect of the STS may lead individuals 
to defer to the anonymity of the process and 
not focus closely enough on their individual 
responsibilities in the decision chain. The sheer 
number of STS-related boards and panels seems 
to produce a mindset of “collective responsi- 
bility.” 


The NSTS Program is a large organization whose 
mission involves the development, deployment, and 
operation of a complex space vehicle in a wide 
range of missions. Associated with each milestone 
in the development of any NASA space system and 
its constituent parts, or in the preparation for a 
space mission, are one or more reviews. These 
reviews may be made from the standpoint of 
requirements, engineering design, development sta- 
tus, safety, flight readiness, or resource require- 
ments. Conducting each review is a team, panel, 
or board, w'hich may or may not be permanently 
empaneled. As described in Section 3.2.2, in the 
NSTS Program there are review groups at every 
level of management, including the contractor or- 
ganizations. 

Figure 5-1 1 depicts the review groups associated 
with the NSTS FMEA/CIL and hazard analysis 


processes alone. There are also boards to review 
design requirements and certification, software, the 
Operations and Maintenance Requirements and 
Specifications Document (OMRSD) and the Op- 
erations and Maintenance Instructions (OMI), the 
Launch Commit Criteria, and mission rules. There 
are flight readiness reviews at each stage of prep- 
aration, with a Launch System Evaluation Advisory 
Team to assess launch conditions and a Mission 
Management Team to oversee the actual mission. 

The Committee developed a concern about a 
possible attitudmal problem regarding the decision 
process on the part of the NASA personnel engaged 
in it. Given the pervasive reliance on teams and 
boards to consider the key questions affecting 
safety, “group democracy” can easily prevail, with 
the result that individual responsibility is diluted 
and obscured. Even though presumably the chair- 
man of each group has official responsibility for 
the decision, most decisions appear to be highly 
participatory in nature. In a CCB review audited 
by the Committee, for example, there were 25-35 
people present and the role of the chairman was 
not especially distinct. Each action appeared to be 
a consensus action by the board. 

It is possible that this is a factor in the problem 
identified by the Rogers Commission: “ . . . a NASA 
management structure that permitted internal flight 
safety problems to bypass key Shuttle managers” 
(Vol. I, p. 82). For example, the Level II PRCB 
conducts daily and weekly meetings — usually via 
teleconference — in which as many as 30 people 
participate. It is certainly conceivable that individ- 
uals might be reluctant to express their views or 
objections fully under such circumstances. Also, 
passing decisions upward through the ranks of 
review boards may reduce each chairman's sense 
that his decisions are crucial. As a case in point, it 
is clear from the report of the Rogers Commission, 
and from statements made to the Committee by 
NASA personnel involved, that the lines of au- 
thority and responsibility in the flight readiness 
review decision-making chain had become vague 
by the time of mission 51-L. 

In discussing this issue, NASA’s Associate Ad- 
ministrator for SRM&QA pointed to the SR&QA 
directors at the field centers as the individuals wdth 
primary responsibility for the safety of the Shuttle 
system. They are said to have full “responsibility, 
authority, and accountability.” Nevertheless, these 
individuals do make inputs to larger and higher 
boards, so that in the end all decisions become 
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FIGURE 5-11 NASA relies on a multilayered system of panels and boards for decisions on engineering design and safety matters. 















collective ones, lacking the crucial mindset of 
individual accountability. 

It is possible that a semantic problem is partly 
at fault here, in that NASA managers often refer 
to "the board” as being synonymous with its 
chairman, with respect to decision authority. 
Nevertheless, a mindset is thereby established in 
which it is not clear whether these are individual 
or group decisions. 

The Committee contrasted the NSTS system with 
that of the U.S. Air Force, in which the hoard 
(including its chairman) makes recommendations 
to the decision maker. One positive point in favor 
of NASA’s system is that, there, the chairman (who 
is the decision maker) is required to listen "in 
public” to all dissenting views. 

The Committee recognizes the important role 
played by the many panels and boards in the NSTS 
program in providing coordination, resolving prob- 
lems and technical conflicts, and reviewing and 
recommending actions. These entities allow the 
different interests and skill groups to bring forward 
their inputs, contribute their knowledge, and thus 
minimize the risk that a proposed action will 
negatively affect some aspect of the STS. 

Recommendation (10a): 

The Committee recommends that the Adminis- 
trator of NASA periodically remind all NASA 
personnel that boards and panels are advisory in 
nature. He should specify the individuals in NASA , 
by name and position , who are responsible for 
making final decisions while considering the advice 
of each panel and board, NASA management 
should also see to it that each individual involved 
in the NSTS Program is completely aware of his! 
her responsibilities and authority for decision mak- 
ing. 


5.10.2 Adequacy of Orbiter Structural Safety Margins 


The primary structure of the STS has been 
excluded, by definition, from the FMEA/CIL 
process, based on the belief that there is an 
adequate positive margin of safety. However, 
the Committee questions whether operating 
structural safety margins have actually been 
proven adequate. 


Completion of the Model 6.0 loads study 
and the revaluation of margins of safety based 
on these loads will significantly improve 
NASA’s grasp of actual operating margins of 
safety. 


NASA groundrules exclude primary structure 
from the FMEA/CIL process. NASA has apparently 
assumed that the structural reliability of the STS 
(including the Orbiter, External Tank, and Solid 
Rocket Boosters) is close to 1.00, because the 
operating loads are believed to be less than the 
proof load to which the vehicle has been subjected. 
It is true that some structures have reliability 
approaching 1.00; examples include bridges, build- 
ings, and even commercial airliners. But there is a 
considerable difference between the Shuttle, a first- 
of-its-kind vehicle operated under unique condi- 
tions and challenging environments, and a com- 
mercial airliner, which is designed and tested to 
loads and conditions that are well understood. In 
addition, in the case of a commercial airliner the 
certifying agency (FAA) and operator organizations 
act as independent rule makers and auditors. No 
such independent check and balance exists for the 
STS, where NASA controls all functions in-house 
(including requirements, analysis methods, testing, 
and certification) — primarily within the NSTS pro- 
gram. 

The original development plans for the Orbiter — 
the most complex and vulnerable element, and the 
only manned element — included a conventional 
structural test program for certification of the 
structural integrity. A complete, full-scale structural 
test article (an Orbiter vehicle) was to be included 
which was to be loaded to 1.4 times the operating 
limit load in the most critical conditions. (This 
compares to the conventional value of 1.5 used by 
the military and the FAA.) Due to budget problems 
NASA decided to eliminate one of the planned 
flight vehicles and convert the static test article 
(#099, Challenger) to a flight vehicle after a series 
of proof tests to only 1.20 times the limit load. 
Some loading conditions actually did not exceed 
1.15 times the limit load. Therefore, the tests did 
not even verify a 1.4 strength margin over limit 
loads. Subsequent flight test data and calculations 
show that in some areas the maximum operating 
loads are actually 15% to 20% higher than those 
originally postulated, so that the static proof load- 
ing tests demonstrated only approximate limit 
conditions. Thus, today there is no demonstrated 
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verification of safety margins for critical elements 
of the Orbiter. 

The model of loads and stresses on the Orbiter 
used in its original design has been revised once. 
Bv 1983 even these data had become suspect, and 
another complete revision of loads using the latest 
test and analysis data was begun. Calculated strength 
margins from this study (called Model 6.0) are 
expected to be available by November 1987. 

Hie Committee believes that the margin of actual 
strength over maximum expected limit load for 
critical areas of the Orbiter structure is not well 
known. Partly this is because loading conditions 
are complex and unprecedented, and partly it is 
because very little (if any) of the flight structure 
was actually tested to failure. The Committee agrees 
with the decision not to use the FMEA/C1L process 
on STS structures. However, we remain concerned 
about the uncertainty in the actual strength margins 
of safety. The Model 6.0 loads calculation now' 
nearing completion should correct the known dis- 
crepancies in external loads. Verification of the 
Model 6.0 loads by data routinely gathered from 
an instrumented and calibrated flight vehicle, be- 
ginning with the next flight, can help verify the 
model and establish the margins of safety more 
definitively. This knowledge will greatly improve 
NASA s ability to keep Shuttle operations w'ithin 
a safe envelope of structural loads. 

Implicit in the safe operation of any such struc- 
ture is a monitoring system to assure that deteri- 
oration of structural integrity does not occur. An 
effort now underway could add materially to 
NASA’s ability to operate the Orbiter’s structure 
safely over its service life. People with airline 
experience, working under Rockwell International, 
are developing a maintenance and inspection plan 
for the structure. A well-planned periodic inspec- 
tion of this sort is essential, and is the best preven- 
tive for unpleasant occurrences due to structural 
deterioration or other causes. 

Recommendations (10b): 

The Committee recommends that NASA place a 
high priority on completion of the Model 6.0 loads, 
the reevaluation of safety margins for these loads, 
and the early verification and continued monitoring 
of the model 6.0 loads by permanently instru- 
menting and calibrating at least the next full scale 
STS vehicle to fly. We further recommend that 
NASA complete and implement a comprehensive 
plan for conducting periodic inspection and main- 


tenance of the structure of the Orbiters throughout 
the service life of each vehicle. 

5.10.3 Software Issues 


NASA FMEAs do not assess software as a 
possible cause of failure modes. 

There is little involvement of JSC Safety, 
Reliability and Quality Assurance in software 
reviews, resulting in little independent quality 
assurance for software. 

A large amount of data — much of it flight 
specific — must be loaded for each Shuttle mis- 
sion but it is not subjected to validation as 
rigorous as that for the software. 


The Shuttle onboard data processing system 
consists of five general purpose computers (GPCs) 
with their input and output devices, and memory 
units, hour of the five GPCs contain the primary 
software system, known as the Primary Avionics 
System Software (PASS); the fifth is a redundant 
computer w'hich contains the Backup Flight System 
(BFS). The PASS is developed by IBM, and the BFS 
is built by Rockwell. 

In addition to flight software code, there are also 
flight software initialization data, called “I-loads”, 
which are mission-unique parameter values. The 
basic code is reconfigured for specific missions, 
with about two such “reconfigured flight loads” 
per flight. After the software requirements are 
approved, three levels of development tests are 
performed leading to the First Article Configuration 
Inspection, or FACE At the FACI milestone, the 
software package is handed off to the contractor’s 
verification organization for independent testing, 
called Independent Validation and Verification 
(IV&V), which leads to the Configuration Inspec- 
tion (Cl) and delivery to NASA. (The degree of 
independence of the IV&V was discussed in Section 
5.8.) Following mission-specific reconfiguration and 
testing in the SAIL and other JSC laboratories, the 
package is ready for Flight Readiness Review. 

A Shuttle Avionics System Control Board (SASCB) 
is the Level II flight software control board, to 
which the Program Requirements Control Board 
has delegated responsibility for software configu- 
ration control. The Manager of the NSTS Engi- 
neering Integration Office chairs this board and 
signs the flight readiness statement on software; 
thus he is the focus of configuration control and 
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management authority for software. At Level III 
there is a Software Control Board, corresponding 
to the Configuration Control Board for hardware 
issues. 

The testing, control, and performance of STS 
software seem quite good. Out of some half-million 
lines of code in the Shuttle flight software, typically 
an average of one error is discovered beyond the 
Cl. With the emphasis placed on early detection 
of errors, error rates are quite low throughout the 
total 10 million-line Shuttle software system. Only 
once has a software problem disrupted a mission 
(on STS-7, uncertainty about the effect of installed 
software code on a particular abort scenario caused 
a launch scrub). Both the developers and the 
“independent” certifiers perform their own inspec- 
tions of the code. Special “code audits’ are also 
carried out to reinspect targeted aspects of the code 
on a one-time basis, based on criticality, complex- 
ity, Discrepancy Reports (DRs), and other consid- 
erations. Software qualiry control includes weekly 
tracking of DRs through the Configuration Man- 
agement database (which tracks all faults, their 
causes and effects, and their disposition); trends of 
DRs are reported quarterly. 

Although generally impressed with the Shuttle 
software development and testing process, the 
Committee made a number of specific findings. 
First, we note that software is not a FMEA/CIL 
item. NASA personnel state that all software is 
considered to be Criticality 1, with each problem 
being fixed as soon as it is detected through testing 
and simulation. The Committee believes that iden- 
tification and prediction of software faults or error 
modes may be feasible by dividing the software 
into functional modules and then considering the 
various possible failures (e.g., improper constants, 
discretes or algorithms, missing or superfluous 
symbols). 

There is little involvement of the JSC SR&QA 
organization in software reviews, due to the limi- 
tations on staff. As a result, there is little inde- 
pendent quality assurance for software. 

Finally, we note that a large amount of data — 
much of it flight specific — must be loaded for each 
Shuttle mission. Flowever, the data and its entry 
are not validated with the same rigor as in the 
IV&V of the software. 

Recommendations (10c): 

The Committee recommends that NASA: explore 
the feasibility of performing FMEAs on software , 


including the efficacy of identifying and predicting 
fault and error modes; request JSC SR&QA to 
provide periodic review and oversight of software 
from a quality assurance point of view ; provide 
for validation of input data in a manner similar to 
software validation and verification. 

5.10.4 Differences in Procedures Among NASA 
Centers 


Differences in the procedures being used by 
the main NASA centers involved in the NSTS 
Program may reflect an imbalance between 
the authority of the centers and that of the 
NSTS Program Office. The Committee is con- 
cerned that such an imbalance can lead to 
serious problems in large programs where two 
or more centers have major roles in what must 
be a tightly integrated program, such as the 
NSTS and Space Station. Without strong, 
central program direction and integration, the 
success and safety of these complex programs 
can be placed in jeopardy. 


In March 1986, the NASA Associate Adminis- 
trator for Space Flight and the Manager of the 
Level II NSTS Program issued memoranda setting 
forth NASA’s strategy for returning the Space 
Shuttle safely to flight status. Their orders rescinded 
all Criticality 1, 1R, and IS waivers and required 
that they be resubmitted for approval. The process 
also required the reevaluation of all FMEA/CILs 
and retention rationales, as well as hazard analyses. 
Other instructions required that a contractor be 
selected for each STS element (that contractor not 
otherwise being involved in work on the element) 
to conduct an independent FMEA/CIL. No specific 
guidelines were issued by the NSTS Office for the 
conduct of the independent evaluations; the meth- 
ods to be used were determined by the NASA 
centers concerned. Also, the FMEA/CIL reevalua- 
tions were initiated using pre-51L FMEA/CIL in- 
structions, in which there were differences in ground 
rules between JSC and MSFC. (In October 1986, 
the NSTS Program Office issued new uniform 
instructions, NSTS 22206, for the preparation of 
FMEA/CILs, but it took several months for revised 
directions to reach the STS contractors.) Thus, 
some differences emerged in the nature and results 
of the reevaluation conducted by different con- 
tractors. 
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These differences are especially noticeable with 
respect to the FMEA/CIL revaluation procedures. 
The Committee found that, at MSFC, all contrac- 
tors had been instructed to conduct a new FMEA, 
“from scratch." At JSC, the independent contrac- 
tors were told to prepare a new FMEA, but the 
prime contractors were instructed to reevaluate the 
existing FMEA. At KSC, where FMEAs are con- 
ducted only on ground support equipment, a single 
group (not the original designer) was reevaluating 
each category of FMEA, working with the existing 
FMEA. Procedures with respect to the independent 
reviews also differed. At MSFC, the independent 
contractor first performed its FMEA and developed 
any necessary retention rationales; it then com- 
pared those results with the FMEAs and retention 
rationales prepared by the prime contractor and 
wrote specific Review Item Discrepancies (RIDs) 
on points of difference or disagreement. At JSC, 
no RIDs were written and no retention rationales 
were prepared by the independent contractor. Fur- 
thermore, some Orbiter subsystems were initially 
excluded from the review. 

Imtiallv, the Committee was concerned that these 
differences in procedure might reduce the validity 
and effectiveness of the FMEA/CIL reevaluation 
process. However, an audit by the Committee of 
the documentation and review process used by JSC 
in the case of the Orbiter indicated that it is a 
reasonable alternative to the RID process employed 
by MSFC. Nevertheless, the Committee suggested 
in its second interim report to NASA (see Appendix 
C) that the NSTS Program Office "review the 
FMEA/CIL reevaluation processes as implemented 
for each STS element to assure itself that any 
differences will not compromise the quality and 
completeness of the overall STS FMEA/CIL effort. 

This more specific concern for procedural dif- 
ferences led, moreover, to a broader concern over 
the nature of management control within NASA. 
Differences in procedures used by the NASA centers 
in this context and others (e.g., with respect to the 
independence of STS certification, as discussed in 
Section 5.8) lead the Committee to suspect that an 
imbalance may exist between the authority of the 
centers and that of the NSTS Program Office. The 
Committee is concerned that such an imbalance 
can lead to serious problems in large programs 
where two or more centers have major roles in 
what must be a tightly integrated program, such 
as the NSTS and Space Station. Without strong, 
central program direction and integration, the suc- 


cess and safety of these complex programs can be 
placed in jeopardy. 

Recommendation (lOd): 

The Administrator should ensure that strong , 
central program direction and integration of all 
aspects of the STS are maintained via the NSTS 
Program Office. 


5.10.5 Use of Non-Destructive Evaluation Techniques 


Non-destructive evaluation (NDE) tests on 
the Solid Rocket Motor (SRM) are performed 
at the manufacturing plant. Subsequent trans- 
portation and assembly introduce a risk of 
debonding and other damage which may not 
be apparent upon visual inspection. No NDE 
is done on the SRMs in the "stacked config- 
uration at the launch facility. 

New NDE techniques now being developed 
have potential applicability to the STS. 

Problems have been detected by NASA and its 
contractor on the STS Solid Rocket Motor (SRM) 
with debonding between the propellant, liner, in- 
sulation, and case. In April 1986, a USAP Titan 
34D (comparable in design to the SRM) experi- 
enced a destructive failure shortly after launch, due 
to debonding. No such severe consequences have 
been seen from SRM debonding, but bond line 
problems arc nevertheless viewed as critical failure 
modes, especially given the redesign of the SRM 
joints. Voids wfithin the propellant mass are also 
of concern. Destructive inspection of the SRM (e.g., 
cutting and probing) is not feasible, so non-destruc- 
tive methods must be used. On the SRM, most of 
these tests are performed at the manufacturing 
plant; later transportation and assembly introduce 
a risk of debonding and other damage which may 
be more difficult to detect at the launch site. 

There are essentially two issues here: the tech- 
niques employed and the location where inspection 
is done. Shuttle SRM NDE assessment to date has 
employed a combination of visual, ultrasonic, and 
radiographic techniques. The range of NDE tech- 
niques considered by NASA (but not necessarily 
tested) as of January 1987 is shown in Table 5-1. 
According to NASA’s Aerospace Safety Advisory 
Panel, acoustic and thermographic techniques are 
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TABLE 5-1 Non-Destructive Evaluation Methods Considered By NASA 


Method 

Looks For 

Remarks 

Ultrasonics 

Unbonds: case/insulation, inhibitor/propellant, and propel- 
lant/liner 

Propellant/lmer to be confirmed. 

Radial radiography 

Propellant voids/inclusions 


Tangential 

Gapped unbonds: Propellant/lmer, flap bonds, and flap 


radiography 

bulb configuration 

Limited experience base; 

Thermography 

Unbonds: case/insulation inhibitor/propellant, and propel- 

lant/liner 

prop. /liner to be confirmed 

Mechanical 

Unbonds: near joint end case/insulation 

Complex insulation geometry 

Oblique-light 

Gapped edge unbonds: case/insulation and inhibitor/pro- 

Magnifies and automates visual 

video 

pellant 

unbond inspection 

Computed 

Gapped unbonds: all intersecting interfaces, propellant 

Long term 

tomography 

voids/inclusions 


Holography 

Unbonds: near joint end case/insulation 

Excitation and scale concerns 

Acoustic emission 

Unbonds: case/insulation 

Long term 


(Source NASA MSFC) 


thought to be those with the greatest near-term 
potential for improving NDE capabilities with 
respect to the SRM. 12 Another promising group of 
techniques is based on X-ray technology. The 
USAE, in its Titan recovery program, has empha- 
sized NDE techniques including ultrasonic, ther- 
mographic, and X-ray. 1 * Similar efforts are being 
pursued in the Navy's Trident program. 1 * 4 

With respect to the issue of location, NASA has 
determined that the “stacked" configuration of the 
SRM is not amenable to NDE of critical areas 
using available methods. However, NASA engi- 
neers believe that the assembly, rollout, and pad 
hold-down loads on the SRM will not cause de- 
bonding. Therefore, inspections are conducted at 
key processing points in the plant and at critical 
SRM segment locations before stacking at Kennedy 
Space Center. Nevertheless, the Committee remains 
concerned about the possibility of damage resulting 
from transportation, assembly, and rollout. 

We recognize that NASA is (and has been) paying 
serious attention to the NDE issue. However, we 
believe that the technologies are developing rapidly 
enough that continued close attention is warranted. 

Recommendation (lOe): 

The Committee recommends that NASA apply 
all practicable NDE techniques to the SRM at the 
launch facility , at the highest possible level of 
assembly (e.g., SRMs in the “stacked" configura- 


X1 NASA: Aerospace Safety Advisory Panel, Annual Report for 1986 
(February 1987). 

14 Ft. Co!. Frank Gayer, USAF Space Division, personal communica- 
tion. 

14 Dale Kenemuth, SP-273, Dept, of the Navy, personal communica- 
tion. 


tion J, and emphasize development of improved 
NDE methods. 

5.11 FOCUS ON RISK MANAGEMENT 

The current safety assessment processes used 
by NASA do not establish objectively the levels 
of the various risks associated with the failure 
modes and hazards. 

It is not reasonable to expect that NASA 
management or its panels and boards can 
provide their own detailed assessments of the 
risks associated with failure modes and haz- 
ards presented to them for acceptance. 

Validation and certification test programs 
are not planned or evaluated as quantitative 
inputs to safety risk assessments. Neither are 
operating conditions and environmental con- 
straints which may control the safety risks 
adequately defined and evaluated. 

In the Committee's view, the lack of objec- 
tive, measurable assessments in the above areas 
hinders the implementation of an effective risk 
management program, including the reduction 
or elimination of risks. 


Throughout its audit the Committee was shown 
an extensive amount of information related to 
program flow charts, organizations, review panels 
and boards, information transmission, and reports. 
But the Committee did not become aware of an 
organization and safety-engineering methodology 
that could effectively provide an objective assess- 
ment of risk, as described in Section 4. Throughout 
the flow of NASA reports and approvals, both 
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before the 51-L mission and after, judgments are 
made and statements of assurance given by persons 
at every level which are based on data and assertions 
having a wide range of validity. The Committee 
believes that it is not reasonable to expect program 
management or NASA Level I management to 
provide its own in-depth evaluation of presented 
hazard risks. Nor will other panels or boards be 
able to do so without the necessary professional 
staff work being done. That work, in turn, cannot 
be performed without methods for assessing risk 
and controlling hazards. The methods must include 
the establishment of criteria for design margins 
which are consistent with the acceptable levels of 
risk. 

The Associate Administrator for SRM&QA, in 
his new plan for management of NASA’s SR&QA 
activities, stipulates that the SR&QA directors of 
the NASA centers are responsible for assuring the 
safety of their Center’s products and services. 
However, we conclude that unless the safety or- 
ganizations at the centers have (1) the appropriate 
methodology and tools (both analysis programs 
and personnel), and (2) the authority to establish 
criteria for safety margins, specific requirements on 
verification test programs, environmental con- 
straints on operations, and total flight configuration 
validation, they cannot be held responsible for 
assuring an acceptable level of safety of flight 
systems. (In fact, they can never “assure safety,” 
but only assure that the risks have been assessed 
objectively by approved methodologies, and that 
they are being controlled to the levels accepted by 
the appropriate NASA authorities.) 

Figure 5-12 shows that even in the current post- 
51 -L planning, the final result of the hazard analysis 
and safety assessment process is a NASA Space 
Shuttle Hazards Data Base. Having an approved 
list of accepted, identified hazards and a sophisti- 
cated closed-loop accounting and review system 
(the SIAP) may be useful. However, nearly every 
catastrophic accident since the beginning of the 
missile and space programs was caused by some 
already-identified hazard related to potential failure 
modes. The essence of safety-risk management, in 
the Committee’s view, is not just the identification 
and acceptance of potential hazards, nor even the 
performance of a risk assessment for each failure 
mode and hazard; it is getting control of the 
conditions which turn potential into real. The 
FMEAs, CILs, hazard reports, and safety assess- 
ments identify risks, summarize information, ref- 


erence data, provide status, etc. They do not analyze 
or establish the risk levels. Neither do they assess 
quantitatively the validity of the test programs in 
establishing failure margins, or define the operating 
conditions or environmental constraints which af- 
fect the risk levels. 

We believe that the key requirements and con- 
cepts contained in various relevant NASA docu- 
ments (see Section 3, for example) provide a good 
overall framework within which a comprehensive 
systems safety and risk management program could 
be defined and implemented. It is the opinion of 
the Committee that such a program would require 
bringing together appropriate activities into a fo- 
cused “Systems Safety Engineering” (SSE) function 
at both Headquarters and the centers. This SSE 
function would apply across the entire set of design, 
development, qualification and certification, and 
operations activities of the NSTS. These activities 
would be an integral engineering element of the 
NSTS Program. They would involve more than just 
the preparation of reviews, reports, or data pack- 
ages. Instead, systems safety engineering would 
combine the functions of reliability and systems 
safety analysis. It should be responsible for defining 
the requirements and procedures, and performing 
or managing, as appropriate, at least the following 
functions which comprise the basis of a risk as- 
sessment and risk management system: 

1. Identification of failure modes and effects 

2. Establishment of design criteria for redun- 
dancy 

3. Identification of hazards and their potential 
consequences 

4. Identification of critical items 

5. Evaluation of the probability of occurrence 
of causes and consequences of failure modes 
and hazards 

6. Establishment of safety-risk level criteria for 
design margins and hazard controls 

7. Design of qualification and certification test 
programs 

8. Objective assessment of safety risks 

9. Development of acceptance rationale for 
retained hazards and hazard reports 

10. Specification of environmental and operat- 
ing constraints at all levels (parts, subsystem, 
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FIGURE 5-12 NASA NSTS safety analysis. Hazard Reports, and safety assessment process in 1987 (NASA 


JSC SR&QA). 

element, and system) to assure that validated 
margins are not violated 

11. Quantitative evaluation of flight data to 
update safety margin validations 

12. Oversight of quality assurance functions to 
control safety risks 


13. Overall system safety risk assessment and 
definition of the potential to reduce the level 
of risk. 

All of the above systems safety engineering func- 
tions (elaborated upon in Appendix F) are necessary 
both for achieving credible risk assessment and for 


76 























defining the risk controls required to justify ac- 
ceptance of critical failure modes and other hazards. 
During design and development, the quantitative 
evaluation of relative risks for each design against 
acceptable criteria for levels of risk should be 
considered as an integral part of the systems en- 
gineering activity. These activities also would pro- 
vide a definitive basis for establishing the design 
margins and operational constraints needed to 
reduce the overall risk to the accepted level and 
subsequently control the risk. 

Function 13 above (definition of the potential to 
reduce the level of risk) is an essential input to risk 
management. The Committee has the impression 
that changes to the STS often are considered only 
if they will improve its performance or reduce risks 
to that level which has previously been accepted 
m the program. The Committee believes that such 
risks, accepted in the past, logical as that may have 
appeared to be at the time, should not continue to 
be accepted without a concentrated effort to plan 
and implement a program to remove or reduce 
these risks. 

1 he magnitude of the preceding tasks point to 
the need for a large number of highly qualified 
professional systems safety engineers (i.e., systems 
engineers with a safety orientation) at NASA and 
at its major contractors. We were disturbed to 
learn from the Director of the Safety Division at 
Headquarters SRM&QA that, as of April 25, 1 987, 
he had only one professional systems safety engi- 
neer in his division, and that he expects to add 
only two more in the near term and four additional 
ones in the long term. It is troubling to the 
Committee that this important and extremely com- 
plex systems engineering function should be so 
severely constrained by staff limitations, in light of 
the cost of the Shuttle and the risk to its crew. 

Taken together, the tasks listed above have the 
highest leverage on overall risk assessment and the 
control of the causes of hazard. Only professionally 
dedicated systems safety engineers working to- 
gether can develop the expertise and motivation to 
carry out these functions properly. They can per- 
form their control of validation and certification 
programs in an objective way (if not functionally 
assigned to program organizations). The need for 
independent entities to perform certification and 
software IV&V to provide substantiation and con- 
fidence was discussed in Section 5.8. This risk- 
managed approach to the validation and certifi- 
cation functions, including the feedback of flight 


data, should not be done by those responsible for 
design and development. They are performance 
oriented; they generally do not design hardware 
configurations to facilitate margin validation, and 
their proposed certification programs usually are 
not oriented to the demonstration of failure mar- 
gins. 

Finally, it seems to the Committee that it is not 
managerially reasonable to make an organization 
responsible for holding system safety to an agreed 
level of risk without according it responsibility and 
authority over all of the above functions, which 
actually control the risks. 

Another major element of an overall risk man- 
agement program is the quality assurance (QA) 
function. Quality assurance certifies that the hard- 
ware and software have been produced to the exact 
designs which describe the validated and qualified 
system. The “configuration” includes all aspects of 
the hardware and software, including the environ- 
ments which in any way influence the properties 
of materials, stress margins, or temporal behavior 
of parts, subsystems, and elements. 

In 1986, responsibility for policy and oversight 
of the quality assurance function was assigned to 
the new office of the Associate Administrator for 
SRM&QA. I his is appropriate, because overall 
risk management and total systems safety are 
dependent on the quality assurance function 
throughout NASA. The QA function should be 
performed separately from the systems safety en- 
gineering functions (although there is certainly a 
strong oversight interaction between the two). 
Quality assurance should be a responsibility of 
each NASA center (and, of course, each contractor). 
Its purpose is nor to design but to control and 
assure. As part of this function it should control 
the entire set of final released engineering docu- 
ments describing the complete configuration of the 
system. As the Committee understands it, that is 
precisely NASA’s current practice. 

Recommendations (11): 

The Committee recommends that NASA con- 
sider establishing a focused agency-wide Systems 
Safety Engineering (SSE) function, at both Head- 
quarters and the centers, which would: 

be structured so as to be integrally involved in 
the entire set of design, development, validation, 
qualification, and certification activities ; 

— provide a full systems approach to the continuous 


IT 


identification of safety risks (not just failure 
modes and hazards) and the objective (quanti- 
tative) evaluation of such safety risks; 

— provide the output of this function to the NASA 
Program Directors in support of their risk man- 
agement; 

— support the Program Directors by providing 
assurance that their systems are ready for final 


safety certification to the risk levels established 
by the NASA Administrator. 

The Committee also recommends that the STS 
risk management program, based in part on the 
definition of the potential to reduce the level of 
risk developed by the system safety risk assessment, 
include a concerted effort to remove or reduce the 
risks. 
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6 Lessons Learned 


Although this report and its recommendations 
are directed to the NSTS Program, they are of 
broader applicability. It would be wise to consider 
the lessons learned by the Committee when struc- 
turing a risk assessment and management system 
for other programs with similar characteristics, 
such as the Space Station Program. These charac- 
teristics would include large size, use of highly 
complex technology, and major participation bv 
several NASA centers and prime contractors. The 
following are generalized conclusions derived from 
the preceding sections. Numbers in parentheses 
refer to the principal sections of the report from 
which the conclusions were derived. 

6.1 ELEMENTS OF AND 
RESPONSIBILITIES FOR RISK 
ASSESSMENT AND RISK MANAGEMENT 

In the Committee s view, any large, complex, 
multi-center program should entail an overall risk 
assessment and risk management process which 
includes the following basic elements: 

Risk assessment: 

— A comprehensive method for identifying po- 
tential failure modes and hazards associated with 
the system. 

— A specific, quantitative methodology for iden- 
tifying and assessing (or estimating) the safety risks 
of the system. 

Risk management: 

A management process by which the safety 
risks can be brought to levels or values that arc 


acceptable to the final approval authority. Risk 
management includes establishment of acceptable 
risk levels; the institution of changes in system 
design or operational methods to achieve such risk 
levels; system validation and certification; and 
system quality assurance. (4.1) 

I he Committee believes that risk management 
must be the responsibility of line management (i.e., 
the program manager and, ultimately, the Admin- 
istrator of NASA). Only this program management, 
not the safety organizations, can make judicious 
use of the means available to achieve the opera- 
tional goals while reducing the safety risks to 
acceptable levels. The safety organizations at NASA 
centers and Headquarters are staff organizations — 
i.e., they can and should be responsible for provid- 
ing the assessments of a system’s risks. They should 
also be responsible for assuring that the activities 
associated with controlling the risks to the levels 
assessed have been carried out and documented. 
Safety organizations cannot, however, assure safe 
operation ; they can only assure that the safety risks 
have been properly evaluated, and that the system 
configuration and operation is being controlled to 
those risk levels which have been accepted by top 
management. (4.1, 4.3) 

In each such major program, the risk assessment 
and management processes should be supported 
by a focused agency-wide Systems Safety Engi- 
neering function, at both Headquarters and the 
centers involved in the program, which would: 
be structured so as to be integrally involved 
in the entire set of design, development, validation, 
and qualification activities; 

— provide a full systems approach to the contin- 
uous identification of safety risks (not just failure 
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modes and hazards) and the objective (quantitative) 
evaluation of such safety risks; 

— provide the output of this function to the 
program director in support of his risk management 
process; 

— support the program director by providing 
assurance that his system is ready for final safety 
certification to the risk levels established by the 
NASA Administrator. (5.1 1) 

This focused systems safety engineering would 
combine the functions of reliability and systems 
safety analysis. It should be responsible for defining 
the requirements and procedures, and performing 
or managing, as appropriate, at least the following 
functions which should comprise the basis of a risk 
assessment and risk management system: 

1. Identification of failure modes and effects 

2. Establishment of design criteria for redun- 
dancy 

3. Identification of hazards and their potential 
consequences 

4. Identification of critical items 

5. Evaluation of the probability of occurrence 
of causes and consequences of failure modes 
and hazards 

6. Establishment of safety-risk level criteria for 
design margins and hazard controls 

7. Design of qualification and certification test 
programs 

8. Objective assessment of safety risks 

9. Development of acceptance rationale for 
retained hazards and hazard reports 

10. Specification of environmental and operat- 
ing constraints at all levels (parts, units, 
subsystem, element, and system) to assure 
that validated margins are not violated 

11. Quantitative evaluation of flight data to 
update safety margin validations 

12. Oversight of quality assurance functions to 
control safety risks 

13. Overall system safety risk assessment and 
definition of the potential to reduce the level 
of risk. 

All of these systems safety engineering functions 
(elaborated upon in Appendix F) arc necessary 


both for achieving credible risk assessment and for 
defining the risk controls required to justify ac- 
ceptance of critical failure modes and other hazards. 
During design and development, the quantitative 
evaluation of relative risks for each design against 
acceptable criteria for levels of risk should be 
considered as an integral part of the systems en- 
gineering activity. Finally, these activities would 
provide a definitive basis for establishing the design 
margins and operational constraints needed to 
reduce the overall risk to the accepted level and 
subsequently to control the risk. They also can 
provide a rational basis for decisions on which 
risks should be reduced through changes in design 
or procedures. (5. 1 1) 

In controlling risks, there must be a formal, 
continuing, and iterative linkage between the risk 
assessment and risk management processes, on the 
one hand, and the system’s engineering change 
activities, on the other. (5.4) 

As a program moves toward its operational 
phase, a system should be established for the rapid 
and effective feedback of inspection and test results, 
and repair and flight data into the risk assessment, 
risk management, and decision making processes. 
In the case of flight programs, this should include 
ensuring that all mission anomalies detected in real 
time and from recorded events, as well as those 
detected during the near-term inspection of any 
recovered hardware, are promptly fed into the 
formal risk assessment and management processes 
for action prior to committing to the next flight; 
all such anomalies should be called to the immediate 
attention of launch decision makers. (5.5) 

6.2 ESTABLISHMENT OF 
RESPONSIBILITY FOR PROGRAM 
DIRECTION AND INTEGRATION 

An imbalance between the authority of the NASA 
centers and that of the Program Office could lead 
to serious problems in a large program where two 
or more centers have major roles in what must be 
a tightly integrated program, such as the STS and 
Space Station. Without strong, central direction 
and integration, the success and safety of these 
complex programs can be placed in jeopardy. The 
Administrator of NASA should ensure that strong 
direction and integration of all aspects of such a 
program are maintained at Level I via the Program 
Office. (5.10.4) There also must be clear and 
unambiguous direction of the program at all levels. 
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Those responsible for decisions should be desig- 
nated and known to all. Boards and panels should 
be advisory to these persons and not decision 
making bodies in themselves. (5.10.1) 

6.3 THE NEED FOR QUANTITATIVE 
MEASURES OF RELATIVE RISK 

Top management and program attention should 
be focused on those items with the greatest risk to 
the safety of a system by means of a prioritization 
of all contributors to the overall risk. (5.2) Ac- 
ceptable levels of risk in each program should be 
set by the Administrator of NASA. However, 
suitable quantitative measures of risk, such as 
probabilistic risk assessment, are required to ob- 
jectively define the acceptable levels, track progress 
toward achieving these levels, and evaluate alter- 
nate courses of action to reduce risk. (5.6, 5.11) 

6.4 THE NEED FOR INTEGRATED REVIEW 
AND OVERVIEW IN THE ASSESSMENT OF 
RISK, AND IN INDEPENDENT EVALUATION 
OF RETENTION RATIONALES 

There should be an integrated review process 
which provides a comprehensive, overall assess- 
ment of risk (including an independent evaluation, 
constantly updated, of retention rationales) upon 
which to base any decisions to grant waivers which 
Permit operating with items that appear on the 
Critical Items List. (5.1, 5.3, 5.11) A balance is 
needed between “bottom-up” assessment tools (e.g., 
FMEA/CIL) and “top-down” analyses (e.g., hazard 
analyses). In particular, the “top-down” analysis 
processes must encompass an integrated system- 
wide engineering analysis, including a system safety 
analysis. (5.7) 

6.5 INDEPENDENCE OF THE 
CERTIFICATION OF FLIGHT 
HARDWARE AND OF SOFTWARE 
VALIDATION AND VERIFICATION 

Responsibility for approval of hardware certifi- 
cation and software Independent Validation and 
Verification (IV&V) should be vested in entities 
separate from the program management structure 
and the centers directly involved in the program’s 
development and operation. However, the latter 
organizations should continue to conduct activities 
supporting certification and IV&V. (5.8) 


6.6 SAFETY MARGINS FOR FLIGHT 
STRUCTURES 

Safety margins for flight structures should be 
established which are in consonance with the ac- 
cepted levels of safety risk for the program. How- 
ever, great care is needed to properly verify that 
the margins have been achieved and are maintained 
in the flight structures. Verification can include the 
use of analytical models, but should be supported 
by static tests before flight, and — in the case of 
reusable flight hardware — continued monitoring in 
flight by permanently instrumenting, calibrating, 
and analyzing data from a representative flight 
system. Also, in the case of reusable hardware and 
man-rated systems destined to remain in orbit for 
long periods of time, comprehensive plans should 
be developed and implemented for conducting 
periodic inspection and maintenance of the struc- 
ture of each system throughout the service life of 
each vehicle or platform. (5.10.2) 

6.7 OTHER 

I here are other important factors in risk assess- 
ment and management which have been discussed 
in this report with respect to the STS as it existed 
following the Challenger accident. However, they 
are items which are considered to be less important 
than those enumerated above or not generally 
applicable to several other programs. Where ap- 
plicable, they certainly should be given serious 
consideration in structuring the risk assessment and 
management program. These other factors are 
listed here by title and section reference: 
Operational Issues (5.9) 

— Launch Commit Criteria Waiver Policy (5.9.1) 
— Human Factors as a Contributor to Risk 
(5.9.2) 

— Cannibalization of Spare Parts (5.9.3) 

Other Weaknesses in Risk Assessment and Man- 
agement (5. 10) 

— Software Issues (5.10.3) 

—Use of Non-Destructive Evaluation (NDE) 
Techniques (5.10.5). 

For any new program, such as the Space Station, 
there is the opportunity to structure an optimum 
risk assessment and management program at the 
outset which builds on the experience gained in 
the NSTS Program and assembles those techniques 
which will be most effective in establishing, mon- 
itoring, and controlling risks to accepted levels. 
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APPENDIX A 

ACRONYMS AND DEFINITIONS 


Acronyms: 


AESIG 

Ascent Flight Systems Integration Group 

ALT 

Approach and Landing Test 

APU 

Auxiliary Power Unit (in the Orbiter) 

ASAP 

Aerospace Safety Advisory Panel 

BPS 

Backup Flight System 

CB 

Control Board (generic) 

CCB 

Configuration Control Board 

CCP 

Configuration Control Panel 

CDR 

Critical Design Review 

CFA 

Critical Functions Assessment 

Cl 

Configuration Inspection 

CIL 

Critical Items List 

CIRA 

Critical Item Risk Assessment 

CR 

Change Request 

DCR 

Design Certification Review 

DER 

Designated Engineering Representative (for the FAA) 

DES 

Data Exchange System 

DR 

Discrepancy Report 

ERB 

Engineering Review Board 

FJEA 

Element Interface Functional Analysis 

EME 

Electromotive force 

ET 

External Tank 

FAA 

Federal Aviation Administration 

FACI 

First Article Configuration Inspection 

LAP 

Failure Analysis Program 

EME A 

Failure Modes and Effects Analysis 

EMEA/CIL 

Failure Modes and Effects Analysis, and Critical Items List 

ERR 

Flight Readiness Review 

GEE 

Government Furnished Equipment 

GPC 

General Purpose Computer (on the Orbiter) 

GSE 

Ground Support Equipment 

HA 

Hazard Analysis 

HPU 

Hydraulic Power Unit (in the SRB) 

HQ 

Headquarters (of NASA) 

HR 

Hazard Report 

IBM 

International Business Machines 

IHA 

Integrated Hazard Analysis 

IUS 

Inertial Upper Stage 

IV&V 

Independent Validation and Verification 

JSC 

Johnson Space Center 

KSC 

Kennedy Space Center 
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LCC 

ITS 

LRU 

LOV 

MET 

MFC 

MICB 

MOD 

MPT A 

MSA 

MSEC 

MVGVT 

NASA 

NDE 

NHB 

NMI 

NPD 

NRC 

NSIS 

NSTS 

OASCB 

OMI 

OMRS 

OMRSD 


Launch Commit Criteria 
Launch and Landing Site 
Line Replaceable Unit 
Loss of Vehicle 

Mission Evaluation Team 
Manufacturing 

Mission Integration Control Board 
Mission Operations Directorate (at JSC) 

Main Propulsion Test Article 
Mission Safety Assessment 
Marshall Space Flight Center 
Mated Vehicle Ground Vibration Test 

National Aeronautics and Space Administration 

Non-Destructive Evaluation 

NASA Handbook 

NASA Management Instruction 

NASA Policy Directive 

National Research Council 

NASA Safety Information System 

National Space Transportation System 

Orbiter Avionics Software Control Board 

Operations and Maintenance Instructions 

Operations and Maintenance Requirements and Specifications 

Operations and Maintenance Requirements and Specifications Document 


PASS 

PCASS 

PDR 

PR 

PRA 

PRACA 

PRCB 

QA 

QRA 

QRM 

RID 

RISD 

RMPP 

SAIL 

SASCB 

SASR 

SCA 

SCAP 

SCRHAAC 

SHIMS 

SIAP 

SIMR 

SIR 

SR&QA 


Primary Avionics Software System 
Program Compliance Assurance Status System 
Preliminary Design Review 
Problem Report 
Probabilistic Risk Assessment 

Problem Reporting and Corrective Action (system) 

Program Requirements Control Board 

Quality Assurance 
Quantitative Risk Assessment 
Quantitative Risk Model 

Review Item Discrepancy (report) 

Rockwell International, Space Division 
Risk Management Program Plan 

Shuttle Avionics Integration Laboratory 
Shuttle Avionics Software Control Board 
Shuttle Avionics Systems Review 
Shuttle Carrier Aircraft 
Shuttle Configuration Analysis Program 

Shuttle Criticality Review and Hazard Analysis Audit Committee 
Shuttle Hazard Information Management System 
System Integrity Assurance Program 
Systems Integration Management Review' 

Systems Integration Review (board) 

Safety, Reliability, and Quality Assurance 
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SRB 

SRM 

SRM&QA 

SSE 

SSM 

SSME 

ssus 

STS 

UCR 

USAF 

VLS 


Solid Rocket Booster 

Solid Rocket Motor (of the SRB) 

Safety, Reliability, Maintainability, and Quality Assurance 

Systems Safety Engineering 

Subsystem Manager 

Space Shuttle Main Engine 

Space Shuttle Upper Stage 

Space Transportation System 

Unsatisfactory Condition Report 
United States Air Force 

Vandenberg Launch Site 


Definitions : 
Certification 


Qualification 


Validation 

Verification 


— consists of qualification tests, major ground tests, and other tests and/or analyses 
required to determine that the design of hardware from component through 
subsystem level meets requirements; a part of verification. 

— is used in terms of qualification tests (see certification), to establish that an item 
meets requirements. 

— the confirmation of some state or condition determined earlier. 

— the process of planning and implementing a program that determines that Shuttle 
systems meet all design, performance, and safety requirements. The verification 
process (for both hardware and software) includes all development, certification 
and acceptance testing, flight demonstration, appropriate pre-flight checkout, 
post-flight activities, and analyses necessary to support verification. 
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APPENDIX B 

ESTABLISHING REPORTS AND DOCUMENTS 


The Shuttle Criticality Review and Hazard Analysis Audit Committee of the National Research Council 
held its opening meeting on September 22, 1986, in Washington, D.C. This appendix contains the 
following key references leading up to its establishment. 


Report of the Presidential Commission on the Space Shuttle Challenger Accident , William P. 
Rogers, Chairman, June 6, 1986. Excerpt: Vol. I, pp. 198-199, Recommendations: introduction 
and Recommendation III. 


Letter from the President of the United States to the Administrator of the National Aeronautics 
and Space Administration, June 13, 1986, directing that the recommendations of the Presidential 
Commission be implemented. 

Letter from the Administrator of NASA to the Chairman, National Research Council, July 3. 
1986, requesting the NRC to form an audit panel as called for in Recommendation III of the 
Presidential Commission. 


Letter from the Chairman of the National Research Council to the Administrator of NASA, 
July 15, 1986, agreeing to establish an audit panel under the National Research Council. 


Report to the President: Actions 
Commission on the Space Shuttle 
p. I 9. 


to Implement the Recommendations of The Presidential 
(. hallenger Accident, NASA, July 14, 1986, excerpt from 


Statement of Task, Committee on Space Shuttle Criticality Review and Hazard Analysis Audit, 
November 12, 1986 (revision). 
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IN ; U*i lUUrtUL l OcctoK 


Presideniial Commission 
on I he 

Spare Shuttle Challenger Arrident 


June 6, 1986 


Dear Mr. President: 

On behalf of the Commission, it is my privilege to present 
the report of the Presidential Commission on the Space Shuttle 
Challenger Accident. 

Since being sworn in on February 6, 1986, the Commission 
has been able to conduct a comprehensive investigation of the 
Challenger accident. This report documents our findings and 
makes recommendations for your consideration. 

Our objective has been not only to prevent any recurrence 
of the failure related to this accident, but to the extent pos- 
sible to reduce other risks in future flights. However, the 
Commission did not construe its mandate to require a detailed 
evaluation of the entire Shuttle system. It fully recognizes 
that the risk associated with space flight cannot be totally 
eliminated . 

Each member of the Commission shared the pain and anguish 
the nation felt at the loss of seven brave Americans in the 
Challenger accident on January 28, 1986. 

The nation’s task now is to move ahead to return to safe 
space flight and to its recognized position of leadership in 
space. There could be no more fitting tribute to the Challenger 
crew than to do so . 


Sincerely , 



William P. Rogers 
Chairman 


The President of the United States 
The White House 
Washington, D. C. 20500 


(KX) Maryland Avenue. SAW Washington, D C 20024 {202)453-1405 
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EXCERPTS FROM: 


Report of the Presidential Commission on the 
Space Shuttle Challenger Accident 

William P. Rogers, Chairman 
June 6, 1 986 

Pages 198-199 


Recommendations 


T he ( Commission lias conducted an ex- 
tensive investigation o! the (Challen- 
ger accident to determine the prob- 
able cause and necessary corrective 
actions. Based on the findings and determinations 
ol its investigation, the (Commission has 
unanimouslv adopted recommendations to help 
assure the return to sale flight. 


I he (Commission urges that the Administrator 
of NASA submit, one year from now, a report 
to the President on the progress that NASA has 
made in effecting the (Commission's recommen- 
dations set forth below: 


hi 


Criticality Review and Hazard Analysis. 

NASA and the primary Shuttle contractors 
should review all Criticality 1, 1R, 2, and 2R 
items and hazard analyses. This review should 
identify those items that must be improved prior 


to flight to ensure mission success and flight safe- 
ty. An Audit Panel, appointed by the National 
Research Council, should verify the adequacy of 
the effort and report directly to the Administrator 
of NASA. 
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THE WHITE HOUSE 
WASHINGTON 

June 13, 1986 


Dear Jim: 

I have completed my review of the report from the Commission 
on the Space Shuttle CHALLENGER Accident. I believe that 
a program must be undertaken to implement its recommenda- 
tions as soon as possible. The procedural and organizational 
changes suggested in the report will be essential to resuming 
effective and efficient Space Transportation System operations, 
and will be crucial in restoring U.S. space launch activities 
to full operational status. 

Specifically, I would like NASA to report back to me in 
30 days on how and when the Commission's recommendations 
will be implemented. This report should include milestones 
by which progress in the implementation process can be 
measured. 

Let me emphasize, as I have so many times, that the men 
and women of NASA and the tasks they so ably perform are 
essential to the nation if we are to retain our leadership 
in the pursuit of technological and scientific progress. 

Despite misfortunes and setbacks, we are determined to press 
on in our space programs. Again, Jim, we turn to you for 
leadership. You and the NASA team have our support and 
our blessings to do what has to be done to make our space 
program safe, reliable, and a source of pride to our nation 
and of benefit to all mankind. 

I look forward to receiving your report on implementing the 
Commission's recommendations. 



The Honorable James C. Fletcher 
Administrator 
National Aeronautics and 
Space Administration 
Washington, D.C. 20546 



IWNSA 

National Aeronautics and 
Space Administration 

Washington, D C 
20546 

Office of the Administrator 


Dr. Frank Press 
Chairman 

National Research Council 
2101 Constitution Avenue 
Washington, DC 20418 


Dear Frank: 

On May 20, 1986, I wrote to you requesting that the National Research 
Council (NRC) form an oversight committee to review the work of NASA and our 
contractors in the necessary redesign, retest, and recertif ication of the 
Solid Rocket Motor (SRM) . Your letter of June 2, 1986, provided NRC 
acceptance of this request, and the committee is now heavily involved in its 
work. I believe that a very effective relationship has been established among 
the parties involved. These actions are consistent with the first 
recommendation of the Presidential Commission on the Space Shuttle Challenger 
Accident. 

I must now, however, ask you for further assistance as we take the 
actions necessary to return the Shuttle to flight status. Recommendation III 
states that NASA and the primary Shuttle contractors should review all 
Criticality 1, 1R, 2, and 2R items and hazard analyses and that the review 
should identify those items that must be improved prior to flight to ensure 
mission success and flight safety. The Commission also recommends that “An 
audit panel appointed by the National Research Council should verify the 
adequacy of the effort and report directly to the Administrator of NASA." 

This letter is to request that the NRC form such an audit panel, verify the 
adequacy of the effort, and report to me. 

The review of these criticality items is under way within the STS program 
at this time and is anticipated to be completed in early 1987. The current 
review is being conducted at the individual project level with program level 
reviews scheduled to begin in the fall. A review of our approach by your 
panel would be most helpful prior to the beginning of the program level 
reviews. Subsequent plans for participation by the panel in the process and 
the reviews will be developed following this initial review. 

NASA will provide the audit panel with access to all information and 
technical data necessary to perform the functions of the review. Background 
and orientation briefings will be provided by NASA and appropriate contractor 
personnel to permit the panel to proceed with their assessment. Additional 
meetings and data exchanges with NASA and/or contractor personnel will be 
arranged as requested by the panel . 

The principal NASA contact during the course of the review will be 
Mr. Jay F. Honeycutt of the Office of Space Flight, telephone 453-1261. 

The expense of the work of the committee will be covered by an addition to 
NASW-3511 . 
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NATIONAL RESEARCH COUNCIL 


0101 CONSTITUTION’ AVENUE WAS HI NC.TTON , D 


July 15, 1986 


OE THE CHAIRMAN' 


The Honorable 
James C. Fletcher 
Administrator 

National Aeronautics and Space Administration 
Washington, D.C. 20546 

Dear Jim: 

I write in response to your letter of July 3, 1986, 
requesting that the National Research Council appoint 
an audit panel to review the NASA approach to resolving 
flight-critical items. The National Research Council 
will undertake this task, and will work to get started 
expeditiously. As you know, members of the NRC staff 
have already met with NASA headquarters management to 
discuss the scope of this effort. 

We will begin by having a one or two day scoping effort 
to better understand the NASA criticality review system 
as well as alternative review and evaluation procedures 
that are used in analogous situations. Upon conclusion 
of this first discussion, we should be ready to select a 
panel and proceed with the effort. 


Yours sincerely, 


>>f4nk Press 
Chairman 


cc: Philip E. Culbertson 

Jay F. Honeycutt 


THE NATIONAL RESEARCH COUNCIL IS THE PRINCIPAL OPERATING AGENCY OF THE NATIONAL ACADEMY OE SCIENCES AND THE NATIONAL ACADEMY OF ENGINEERING 


TO -SERVE GOVERNMENT AND OTHER ORGANIZATIONS . 
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National Aeronautics and Space Administration 


Report to the President 


Actions to Implement 
the Recommendations 

of The Presidential Commission 

an the Space Shuttle Challenger Accident 


EXCERPT FROM PAGE 19: 

The Commission recommended that the 
National Research Council (NCR) appoint 
an Audit Panel to verify the adequacy of 
this effort and report directly to the Admin- 
istrator of NASA. This request has been 
made by NASA and accepted by the NRC. 
The NRC is forming the panel and NASA 
will support them as required. 


July 14, 1986 
Washington, !).('. 
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Code Designator for Group: 


Commission on Engineering and 

Technical Systems 

ASSEMBLY OR COMMISSION 


Committee on Space Shuttle Criticality 
Review and Hazard Analysis Audit 
COMMITTEE 


Aeronautics and Space Encr'q. Board 

DIVISION, OFFICE OR BOARD SUB-UNIT 


STATEMENT OF TASK 


(Make clear what is expected of the group described and by whom the project 
is sponsored. Limit to not more than this page.) 


As recommended in the report of the Presidential Commission on the Space 
Shuttle Challenger Accident, the Committee will audit the review by NASA and 
its primary Shuttle contractors leading to the identification by NASA of 
those items that must be improved prior to resumption of flight to ensure 
mission success and flight safety. Particular attention will be given to the 
Failure Modes and Effects Analyses (FMEA) , Critical Item Lists (CIL) , and 
Hazard Analyses. The audit will concentrate on procedures, techniques, and a 
sampling of specific actions taken by NASA and the contractors in order to 
verify the adequacy of the effort. The results of the audit will be reported 
directly to the Administrator of NASA by a series of letter reports and a 
final report. 

The Executive Committee of the Governing Board of the National Research 
Council approved this effort at its meeting on August 26, 1986 

The work of the Committee is carried out under Contract No. NASW-3511 
with the National Aeronautics and Space Administration. 


November 14 , 1986 
Date of Statement 


September 5. 1986 

(Date of previous statement if applicable) 


OCMMITTEE RECORDS FORM #1 
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APPENDIX C 


LETTER REPORTS TO THE ADMINISTRATOR OF NASA 
AND NASA RESPONSE 

I rior to this final report, the Shuttle Criticality Review and Hazard Analysis Audit Committee issued 
two interim letter reports to the Administrator of the National Aeronautics and Space Administration. 
I he Administrator of NASA provided a response to the Committee regarding the first interim report. It 
also was referenced in NASA s Report to the President of June 1987. These documents are contained in 
this appendix. 


First interim letter report to the Administrator of NASA from Committee Chairman Alton D. 
Slay, January 13, 1987, 4 pp. 

Reply to Committee Chairman Alton D. Slay from the Administrator of NASA regarding the 
first report, April 22, 1987 

Report to the President: Implementation of the Recommendations of The Presidential Commission 
on the Space Shuttle Challenger Accident , NASA, June 1987, excerpts from pp. 41-42 

Second interim letter report to the Administrator of NASA from Committee Chairman Alton 
D. Slay, July 22, 1987, 8 pp. 
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NATIONAL RESEARCH COUNCIL 

COMMISSION ON ENGINEERING AND TECHNICAL SYSTEMS 

210K onsltluhon Avenue V\,ishm^lon. I) L 2041* 

AERONAUTICS AND SPACE 
ENGINEERING BOARD 

January 13, 1987 


The Honorable James C. Fletcher 
Adnmiistrator 

National Aeronautics and Space Administration 
Washington, D.C. 20546 

Dear Jim: 

This is an interim progress report of the Shuttle Criticality Review 
and Hazard Analysis Audit Committee. The National Research Council 
formed this committee in response to your request for an audit of the 
NASA response to the Presidential Goranission Reccmnendation III 
regarding criticality review and hazard analysis. 

The Committee has been a functioning entity since its first meeting on 
September 22, 1986. We have thus far received presentations from and 
engaged in detailed discussions with NASA Headquarters, the National 
Space Transportation System program office, Johnson Space Center, 
Marshall Space Flight Center, and Kennedy Space Center. Similar 
meetings were held at Rocketdyne (Space Shuttle Main Engine) and 
Rockwell International (Orbiter) , and by a working group at Morton 
Thiokol (Solid Rocket Motor) . All of the participants described their 
efforts and progress in reevaluating the Failure Modes and Effects 
Analysis (FMEA) and Critical Items List (OIL) status and in reassess- 
ing hazard analysis and risk management. The Committee also has 
received a briefing on and discussed the process being used by the 
U.S. Air Force Systems Command-Space Division to determine launch 
readiness and safety status. The Titan 34D Recovery Program was 
described as an example. 

The Qommittee has been favorably impressed by the dedicated effort and 
extremely beneficial results obtained thus far from the FMEA/CIL and 
hazard analysis processes. We are very appreciative of the frank and 
open manner in which NASA and contractor personnel have worked with 
the Committee. Cur suggestions have been received in a very respon- 
sive manner in all quarters. We wish to commend Admiral Truly, Arnold 
Aldrich and the NASA Shuttle team involved in the PMEA/CIL-hazard 
analysis processes for the significant work they have performed so 
far. Although our general impressions are favorable, we do have seme 
suggestions for ijiprcrvement. In summary, they are: 

o Criticality 1 and 1R items should be assigned priorities 
based on the probability of occurrence. 

o Since many of the Criticality 1 and 1R items differ substan- 
tially in terms of the probability of failure, NASA should 
consider modifying the definition of critical items to 
account for these differences. 


The National Research Courtal is the principal operating agency of the National Academy of Sciences and the National Academy of Engineering 

to serve government and other organizations 
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o NASA should incorporate its present total system review proce- 
dures in an integrated systems assessment process coupled 
closely with the FMEA/GL reevaluation new being undertaken. 

o Linkage between the STS engineering change activities and the 
FMEA/CIL“hazard analysis processes should be assured. 


SETTING PRIORITIES FOR CRITICALITY 1 AND 1 R ITEMS 
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suggested selection process, along with the rationale that produced 
the priority rating. The waiver decision authority for the remainder 
of the Criticality 1 and 1R items should be delegated to Levels II and 
perhaps III. 


DEFINITI ON OF CRITICALITY CATTOORTES 

The Committee notes that the dedicated response of the entire NASA 
organization and its contractors has produced a variety of items 
which, by precise definition, must be placed in the Criticality 1 or 
1R categories. Marry of the items differ substantially from one 
another in terms of the probability of failure or malperformanoe and 
thus their potential impact on Shuttle operational safety. The 
Committee suggests that NASA consider a modification of the Critical 
Items List to account for these differences, help the priority 
selection process, and better focus present or future efforts to 
achieve safer Shuttle operations. 


INTEGRATED SPACE TRANSPORTATION SYSTEM ANALYSIS 

The Committee understands that various mechanisms are being used by 
NASA to examine total system operation, including propagation of fail- 
ure modes to interfacing or physically adjacent modules or subsystems. 
The Committee does not perceive, however, any formal relationship of 
such evaluation methods to the ongoing FMEA/CIL process. The Commit - 
tee suggests that NASA devise an integrated STS systems assessment 
process which is closely coupled with the FMEA/CIL activity to assure 
assessment of the truly critical safety elements in the STS. This 
includes all combinations of hardware/software/prooedural failures and 
cascading failures. 


RETATTON BETWEEN FMEA/CIL-HAZARD ANALYSIS AND DF STCN CHANGES 

We note that many engineering changes have been undertaken since the 
51-L accident to improve Shuttle safety prior to resumption of flight, 
now scheduled for February 1988. In parallel, the FMEA/CIL and hazard 
analysis reevaluations are under way with completion expected during 
the summer of 1987. Thus, the FMEA/CIL reevaluaticn may not adequate- 
ly reflect all of the engineering changes, nor will there be time to 
incorporate any substantial design changes that may be indicated by 
the outcome of the FMEA/CIL reevaluation, hazard analyses, and related 
activities. The Committee recommends that NASA assure a close linking 
between the STS engineering change activities and the EMEA/CIL-hazard 
analysis processes. 


100 





Letter to the Honorable James C. Fletcher 


- 4 - 



101 




NASA 

National Aeronautics and 
Space Administration 

Washington D 0 
20546 ' 

OMuvof the 1 Administrator 


APR 22 '337 


General Alton D. Slay 
National Research Council 
National Academy of Engineering 
2101 Constitution Avenue, NW (NAS 307) 
Washington, DC 20418 


Dear A1 : 

In reply to your January 13, 1987, interim progress report of the 
Committee on Shuttle Criticality Review and Hazard Analysis, your four 
suggestions are repeated, along with NASA's response to each. 

NRC Comment: "Criticality 1 and 1R items should be assigned priorities 

based on the probability of occurrence . " (This comment also suggested the use 
of probability analysis techniques and the delegation of certain criticality 
items to lower levels of the organ i za tion . ) 

NASA Response : The National Space Transportation System is in the 

process of selecting and implementing a critical items priori tization 
technique for the Shuttle program. Five different techniques have been 
evaluated by review teams at JSC, MSFC, and KSC. One of these techniques has 
been selected to be presented to the program manager at a Program Requirements 
Control Board (PRCB) for baselining as a formal program requirement. The 
chosen approach will overlay the existing Failure Mode and Effects 
Analys i s/Cri t i cal Items List (FMEA/CIL) activity with minimum perturbation, 
yet provide an effective measure of relative risk in order to focus future 
review emphasis and resource allocations. In parallel with the prioritization 
technique development, an effort is also under way to assess the utility of 
probabilistic risk assessment in the NSTS FMEA/CIL process. Activities have 
been initiated to engage two independent firms with expertise in probabilistic 
risk assessment to perform detailed reviews of the orbiter auxiliary power 
unit and the shuttle main propulsion pressuri zation system. A decision to 
apply such probabilistic risk assessment techniques to other elements of the 
Shuttle will depend upon assessments of the results and impacts of those 
efforts and comparison of these results with the results of the mainline 
FMEA/CIL activity. Delegating the review and approval of certain critical 
items will be decided after the results of the prioritization and risk 
assessment activities have been thoroughly assessed. 

NRC Comment : "Since many of the Criticality 1 and 1R items differ 

substantially in terms of the probability of failure, NASA should consider 
modifying the definition of critical items to account for these differences." 
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NASA Response : We expect the FMEA/CIL prioritization process wi ] ] 

provide the necessary definitions and program focus in this regard. 

NRG Comment : "NASA should incorporate its present total system review 

procedures in an integrated systems assessment process coupled closely with 
the FMEA/CIL reevaluation now being undertaken.* 1 

NASA Response : Since the Challenger accident, NASA has reemphasized its 

risk management effort. An important feature of the revised effort must be a 
"systems engineering" approach that integrates the various elements of the 
risk management process to assure assessment of the combinations of hardware, 
software, procedures, and cascading failures. NASA’s new Associate 
Administrator for Safety, Rel iabil i 1 ity. Maintainability and Quality Assurance 
has been tasked to develop a new agencywide risk management system. 

NRC Comment : "Linkage between the STS engineering change activities and 

the FMEA/CIL hazard analysis processes should be assured." 

NASA Response : Engineering changes are processed through the same Space 

Shuttle configuration control boards that conduct the review of the 
FMEA/CIL. A recent change to the procedure requires an assessment of each 
change request to determine if it affects any Criticality 1 or 2 hardware. 

The nature of the combined change control and FMEA/CIL processes is such that 
the total process cannot be completed until the last change to be implemented 
before flight has itself undergone a FMEA and been di sposi tioned by the 
board. Regardless of the timetable established by the NSTS working schedule 
for FMEA/CIL preparation and review, the changes that result will be dealt 
with in the same manner as the generating FMEA items. All changes mandatory 
for first flight will undergo the same rigor, even if this results in a flight 
schedule impact. The NSTS Systems Design Reviews which began early last year 
have significantly reduced the likelihood of new changes being identified that 
have major schedule impacts. 

The dedication of your committee and the sincerity of its comments are 
very much appreciated by NASA. I hope you find our actions in response to 
your suggestions to be both appropriate and timely. Thank you again for your 
hel p. 

Sincerely , 

vyames U. Fletcher 
/ /Administrator 
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Report to the President 

IMPLEMENTATION 

of the 

RECOMMENDATIONS 

of the Presidential Commission 
on the Space Shuttle 
Challenger Accident 


June 1987 






EIFA’s have been conducted on ET/ 
orbiter, SSME/orbiter, and SRB/ET/orbiter 
I interfaces. These analyses have been 
reviewed by NASA and the systems integra- 
tion contractor, and the results are under 
evaluation by the element project offices and 
the NSTS Engineering Integration Office. 
When this review is completed, the finalized 
EIFA’s will be presented to the PRCB for f on 
mal approval. 

! NATIONAL RESEARCH 
COUNCIL AUDIT 

The Shuttle Criticality Review and Haz- 
ard Analysis Audit Committee of the 
National Research Council (NRC), chaired 
by retired USAF General Alton Slay, reports 
directly to the NASA Administrator and is 
responsible for verifying the adequacy of the 
proposed actions for returning the Space 
Shuttle to flight status (see Appendix F for 
panel membership and a summary of 
responsibilities). 

The committee has discussed the FMEA/ 
CIL/HA reevaluation process with repre- 
sentatives from NASA Headquarters, JSC, 
KSC, and MSFC. Meetings have been held 
at the centers and at Rockwell Internation- 
al's Space Transportation Systems and 
Rocketdyne divisions; Morton Thiokol; 
United Space Boosters, Inc.; Sundstrand 
Corporation; and NRC Headquarters. The 
committee is evaluating the adequacy of the 
review process, checking for continuity 
across all elements of the program, and 
reviewing changes that NASA and its con- 
tractors have made since the accident. 

A preliminary report was submitted to 
the NASA Administrator on January 13, 
1987, indicating that the committee has been 
favorably impressed with the results obtained 
from the FMEA/CIL and hazard analysis 
i processes. While the committee’s general j 
impressions were favorable, it did make some 
suggestions for improvements. In summary, 
these suggestions are: (1) Criticality 1 and 1R 
: items should be assigned priorities based on 

the probability of occurrence; (2) since many 
of the Criticality 1 and 1R items differ sub- 
stantially in terms of the probability of fail- i 
ure, NASA should consider modifying the 


definition of critical items to account for 
these differences; (3) NASA should incorpo- 
rate its present system review procedures into 
an integrated system assessment process 
coupled closely with the FMEA/CIL reevalu- 
ation now being undertaken; (4) linkage 
between the STS engineering change activi- 
ties and the FMEA/CIL/HA processes 
should be provided. 

NASA has responded to these sugges- 
tions in the following manner: 

1. Several candidate systems for prioritizing 
critical items have been evaluated by each 
of the projects. A hybrid system has been 
developed that incorporates the positive 
features of the candidate systems and spe- 
cifically addresses probability of occur- 
rence. The approach can be overlaid on 
the existing FMEA activity with mini- 
mum perturbation, providing an effective 
measure of relative risk. 

In parallel with the development of 
prioritization techniques, an effort is 
under way to determine the applicability 
of probability risk assessment to the 
FMEA/CIL process. This technique is 
used in the nuclear power industry to pro- 
vide relative-risk assessments. Two firms 
with expertise in probability analysis have 
been selected to perform detailed assess- 
ments of the orbiter auxiliary power unit 
and the main propulsion engine pressur- 
ization system. A decision to apply proba- 
bility analysis techniques to other systems 
of the program will depend on the results 
of these assessments. 

2. The FMEA/CIL prioritization process 
will provide the necessary program focus 
and more definitive definitions in 
response to the committee’s concern 
expressed in their second suggestion. 

3. Since the accident, NASA has reempha- 
sized its risk management effort. An 
important feature of the revised effort is a 
“systems engineering” approach that inte- 
grates the various elements of hardware 
and software failure analysis. Further dis- 
cussion of risk management is included in 
the response to Recommendation IV. 

4. Engineering changes are processed 
through the same project and program 
control boards that conduct and approve 
the reviews of the FMEA/CIL. Each 
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change request will be assessed to deter- 
mine if it affects any Criticality 1 or 2 
hardware to ensure that the required link- 
age is provided. 

The NRC audit committee is reviewing 
additional areas to identify potential meth- 
ods of reducing risk. These include the design 
qualification and flight certification pro- 
cesses, launch commit criteria and waiver 
policy, and the generation, review, and 
approval of retention rationale for waivers to 
critical items. 

Also being reviewed are the overall 
safety, reliability, maintainability, and quality 
assurance program, the definition of struc- 


tural analysis requirements, the establish- 
ment and verification of analyses for margins 
of safety, the risk management processes for 
software, and the processes for analyzing pay- 
load safety. 

Interim findings and recommendations 
from these reviews will be submitted to the 
NASA Administrator through letter reports, 
as required. The final report, anticipated in 
1987, will include an assessment of the proce- 
dures reviewed and recommendations for 
improving the Shuttle risk management sys- 
tem. As reports are received, any recommen- 
dations included will be reviewed by NASA 
and responses will be provided to NRC. 
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NATIONAL RESEARCH COUNCIL 

COMMISSION ON ENGINEERING AND TECHNICAL SYSTEMS 

2101 Constitution Avenut* Washington [> C 204 1H 

AERONAUTICS AND SPACE 
ENGINEERING BOARD 

July 22, 1987 


The Honorable James C. Fletcher 
Administrator 

National Aeronautics and Space Administration 
Washington, D.C. 20546 


Dear Jim: 

I am pleased to provide this second interim progress report of the 
National Research Council’s Committee on Shuttle Criticality Review and 
Hazard Analysis Audit. I wish to thank you for your letter of April 22, 
1987, in which you summarized the steps that the National Aeronautics and 
Space Administration (NASA) is taking in response to the suggestions in 
our first report to you of January 13, 1987. The Committee is indeed 
gratified by the progress NASA is making in strengthening the Space 
Transportation System (STS) risk management program. We also appreciate 
the continued close collaboration with NASA and contractor personnel, and 
note the interest they show and their responsiveness to the Committee's 
suggestions. The purpose of this letter is to react to the actions of 
NASA taken in response to our first letter, and to comment on seme 
additional aspects of STS risk management. 

Since our last report, the full Committee has met six more times, 
including visits to Marshall Space Flight Center, Kennedy Space Center, 
again to Rocketdyne on the Space Shuttle Main Engine (SSME) , and with 
Rockwell Space Transportation System Division on STS integration. Working 
groups of the Committee also met at appropriate NASA centers and 
contractors to review the risk management aspects of the Solid Rocket 
Booster (SRB) ? Orb iter Auxiliary Power Unit (ARJ) and SRB Hydraulic Power 
Unit (HHJ) ; Shuttle structural analysis, margins end verification; Orb iter 
nose wheel steering; software? and Space Shuttle Main Engine. This 
continued audit has allowed the Committee to evaluate the changes NASA is 
making in the STS risk management processes and to identify some 
additional views which we thought would be useful to share with you in 
this interim report. 

Regarding the response of NASA to the first report, the Committee's 
reaction is, in summary: 

o The work under way to assign priorities to Criticality 1 and 1R 
items appears to be a significant step forward. We also are 
pleased to note the tests of Probabilistic Risk Assessment (FRA) 
now being conducted. 

o The Committee looks forward to learning hew the prioritization 
process will be used to redefine the critical items by taking 
into account the differences in the probability of occurrence. 


The National Research Council is the principal operating agency of the National Academy of Sciences and the National Academy of Engineering 

to serve government and other organizations 


107 




Letter to the Honorable James C. Fletcher 


- 2 - 


o We enthusiastically support the agency-wide risk management 

system n ow being developed. However, we are still concerned with 
the apparent lack of consideration of the STS as a single, 
complex system rather than a collection of subsystems. 

o The steps taken to link the engineering change control and the 

Failure Modes and Effects Analyses/Critical Items List (FMEA/CIL) 
processes are both appropriate and welcome. We are also 
reassured by your statement that the flight schedule will not be 
allowed to reduce the rigor with which the risk management tasks 
will be conducted. 

The Committee's continuing audit since our last interim report leads us to 
provide initial comments on the following topics: 

o Persons involved in the STS program frequently give the 

impression that decisions are made collectively by panels, 
boards, etc., rather than by the responsible individuals. We 
believe that the Administrator of NASA should periodically remind 
the NASA organization of the specific individuals responsible for 
final decisions based on the advice received from each advisory 
body. 

o The new System Integrity Assurance Program (SLAP) , especially its 
Program Compliance Assurance and Status System (PCASS) , now being 
implemented by the National Space Transportation System (NSTS) 
Program office, will be invaluable as a tool in support of STS 
risk management. The STS failures data base, when completed, can 
be of major importance in determining the probability that the 
worst case effect postulated in the FMEA will actually occur. 

o The progress being made in improvements to the SSME as a result 
of the FMEA/CIL reevaluation is very encouraging. 

o The changes being introduced in NASA Headquarters Safety, 
Reliability, Maintainability and Quality Assurance (SKM&QA) 
appear to be well planned and in the right direction. However, 
we are concerned that it is not adequately staffed to cope with 
the demands placed upon it, and recognize that close 
collaboration with the centers and program offices is necessary 
to improve risk management in NASA. 

o A risk assessment report, based upon both the FMEA/CII/Fetention 
rationale and a comprehensive hazard and safety assessment, 
should be the basis for the acceptance rationale in considering 
waivers to fly Criticality 1 components. 
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o There appear to have been unexplained differences among the STS 
elements in the approach to and the rigor of the FMEA /CIL 
reevaluations. The methods being used should be reviewed to 
assure that any differences which exist will not ccmpromise the 
FMEA/CIL reevaluation process. 

o The panels and boards (Program Requirements Change Board, Flight 
Readiness Review, etc.) that advise key NASA decision makers are 
not adequately staffed with people skilled in the statistical 
sciences of data analysis, statistical inference, and 
probabilistic risk assessment? persons with such skills should be 
added to provide improved support of the decision making process. 

o A greater effort is needed to plan for additional elimination or 
reduction of risks in the STS. 

Following is an elaboration on these topics. 


COMMENTS ON NASA RESPONSE 


Setting priorities for Criticality 1 and 1R items 

We are pleased to see the steps being taken to assign priorities to the 
critical items. The Committee notes that the technique proposed for 
implementation lends itself to the incorporation of quantitative measures 
of risk and probabilities of occurrence as these measures are developed. 
However, the Committee urges that care be taken to assure that over 
simplified but potentially inaccurate quantitative measures are not used. 
We have been assured by a representative of the NSTS office that the 
prioritization process can be completed well before the next Shuttle 
launch, which we believe to be an important consideration. We look 
forward to learning hew NASA plans to use the results of this process. I 
can understand your desire to defer a decision to delegate from Level I of 
NASA the review and approval of waivers on certain critical items until 
you have assessed the results of the new prioritization and risk 
assessment processes. However, the Committee believes that before the 
next launch some method should be used to assure that NASA Level I gives 
special attention to the highest priority items identified through the 
prioritization process. 

The Committee is delighted to learn that NASA is testing the use of 
Probabilistic Risk Assessment (FRA) on the ARJ and HFU, and the Shuttle 
main propulsion pressurization system. We also are aware of the SSME 
certification process assessment study being conducted at the Jet 
Propulsion Laboratory, which includes a FRA of the SSME. The Committee 
cautions NASA on its intention to evaluate PRA by canparing the results of 
only two or three disparate tests of FRA with the results obtained earlier 
by the FMEA/CIL process. The criterion should not only be whether a 
significant new problem is identified by the PRA. The FRA test results 
should be used by NASA to answer the questions: Would the PRA have helped 
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entities allow the different interests and skill groups to bring forward 
their inputs, contribute their knowledge, and thus minimize the risk that 
a proposed action will negatively affect same aspect of the STS. We 
presume that each of these entities recommends an action to an appropriate 
official, such as a project manager at Level III or the Deputy Director of 
the NSTS Program at Level II, who actually makes and takes responsibility 
for the decision. 

The Committee is concerned about a possible attitudinal problem regarding 
the decision process on the part of the NASA personnel engaged in it. 

When we ask a NASA manager about how a decision is made, often we are told 
that it is made by such-and-such a board. We are concerned that there may 
be a tendency for those involved in the multi-layered review and decision 
process to hide in the anonymity of panels and boards, and that each 
person who must sign off on an item may not be inclined to concentrate 
enough on his or her individual responsibility in light of the number of 
levels of group reviews involved in the decision process. The Committee 
recommends that the Administrator of NASA periodically remind all of the 
NASA organization of the specific individuals by name and position who are 
responsible for final decisions (and the organizational relationships 
among them) based on the advice coming from each panel and board. This 
would not detract from the important role played by all members of the 
panels and boards in providing advice to the decision maker. 


Potential of the Procrram Compliance Assurance Status System (PCASS) 

The Committee is enthusiastic about the potential of the PCASS, which is 
being established as a major part of the new System Integrity Assurance 
Program (SLAP) of the NSTS. It should improve the quality of information 
available to key decision makers (e.g. , at Flight Readiness Reviews) by 
providing in near real-time an integrated view of the status of problems 
with the STS, including trends, anomalies and deviations, assessments, and 
closure information. Plans to keep up to date and cranputerize the FMEA 
will provide a very useful input to PCASS. The Committee also has learned 
of the data base maintained by the Johnson Space Center (JSC) SR&QA office 
which documents in one place the failures which have occurred on the 
Orb iter during ground testing and in flight. It is encouraging to note 
that of those failures of components on the Orbiter categorized as 
Criticality 1 which have occurred during flight, none resulted in the 
worst-case effect postulated in the FMEA. These failure data can be very 
valuable in connection with the new CTL prioritization system in 
establishing the probability that the postulated effects will actually 
occur, given the failure in flight. We understand that this, and similar 
data bases for the other STS elements, will be integrated into the PCASS. 
We believe that PCASS, as a real-time data base, has the potential to 
become a key element of the STS risk management, and thus its full and 
timely development should be encouraged and supported. The Committee 
recommends that this development be given a high priority and that the 
potential users of PCASS, including key decision makers, be involved 
closely now in its development. 
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tress on the SSME as a result o f the 


CIL reevaluation 


Based on its second visit to Rockwell International - Rocketdyne Division, 
the Committee is encouraged with the progress being made in improving the 
SSME as a result of the FMEA/CIL reevaluation. We also applaud the 
improvements in the test program which are designed to validate the 
reliability of the modified SSME before first flight. The SSME is one of 
the few cases in which the Committee has found that changes have been made 
as a result of the FMEA/CIL. In most other cases, the Committee observes 
that the initiation of changes has not originated with the FMEA/CIL 
process. 


NASA Head 


ers Safetv, Reliability. Maintainabili' 


In April, the Committee received a comprehensive briefing regarding the 
status and plans for the NASA Headquarters SRM&QA program. We are 
encouraged by the progress that has been made. The Committee believes 
that the program is going in the right direction. We recognize the 
magnitude of the task ahead; however, the goals and the program plans 
developed so far appear to be sound. The Committee is cnnrpmpH that 
SRM&QA (at Headquarters and the centers) is not adequately staffed to cope 
with the demands being placed upon it, perhaps necessitating the 
additional use of contract personnel in order to carry out their functions 
before the launch of the next Shuttle. The Committee also believes that 
it will be particularly important to develop close collaboration with the 
NASA centers as well as other program offices in order to do those things 
which are needed to create a total risk management system augmenting the 
independent check and balance role of SRM&QA. 


Input to waiver decisions 

The Committee understands that FMEAs, CIL determinations, and their 
retention rationale are developed by the STS design and development 
people. The SRM&QA, operations and other relevant personnel contribute as 
appropriate. The FMEA/CIL and retention rationale so produced are among 
the inputs to the hazard analyses which are done by the safety people. In 
this case, design, development, operations and other relevant personnel 
contribute as appropriate. The output of these two processes (FMEfy'CIL/ 
retention rationale on the one hand, and hazard analyses on the other) are 
individually approved by the Program Requirements Control Board (PRCB) . 
However, the Committee is concerned that the FMEA/CIIe with their 
design- based retention rationale have became the only effective input to 
levels II and I in their waiver decisions to accept the designs as safe 
enough to fly. 

The Com mittee recommends that the present design-based retention rationale 
should be only one part of the rationale required to accept the hazards 
which can result from each critical failure mode. The other part should 
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be the output of the hazard and safety assessments, including evaluations 
of the probability that the hazardous conditions will actually develop and 
the probability that these conditions will lead to a Criticality 1 
consequence. A risk assessment report, embracing the design retention 
rationale and the hazards/safety assessment, should provide the acceptance 
rational e for consideration by Level II and I managers in reaching their 
decisions on the granting of waivers. 


Differences in FMEA/CIL reevaluation process among STS elements 

In the Committee * s audit of the reevaluation of the FMEA/CIIs, a number of 
differences were found in the process being used by different element 
project offices and contractors. In some cases, we were unable to 
ascertain the reasons for the observed differences. For example, the 
independent contractors evaluating the FMEA/CILs for the STS elements 
managed by the Marshall Space Flight Center are required to review all 
subsystems and to file a Review Item Discrepancy (RID) when they differ 
with the results of the element contractor’s analysis. On the other hand, 
the independent contractor for the Orbiter evaluation was not directed to 
review all parts of the Orbiter and does not file RIDs. We understand 
that JSC now has directed the contractor to review all subsystems in the 
Orbiter. An audit by the Committee of the documentation and review 
process us ed in the case of the Orbiter indicates that it is a reasonable 
alternative to the RID process. Nevertheless, the Cammittee suggests that 
the NSTS program office review the FMEA/CIL reevaluation processes as 
implemented for each STS element to assure itself that any differences 
will not compromise the quality and completeness of the STS FMEA/CIL 
effort as a whole. 


Expertise in Statistical Sciences 

The key technical decision makers in NASA operate as ctairmen of bodies 
that review relevant technical information. The decisions involve design, 
requirements, waivers, launch decisions, etc. Much of this information is 
in the form of complex engineering data, such as test, inspection, flight, 
and weather data. These bodies draw upon experts in many engineering 
disciplines to deal with the complexities. Indeed, it is important that 
there be close ties among the design engineers, test and analysis people, 
and decision makers throughout the process of designing, building, 
certifying, and using components and systems. However, the Committee 
finds that these bodies are not adequately supported by people skilled in 
the statistical sciences to aid in the transformation of complex data into 
information useful for decision making. 

The Committee recommends that NASA build up its staff of experts in the 
statistical sciences (civil servants and contract support) to provide 
improved analytical support of risk management and of key decision makers 
by the application of modem statistical analysis, inference and 
assessment techniques. 
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Reducing the risk in the Space Transportation System 

Even with the current FMEA/CIL and hazard analysis efforts which are 
supported thoroughly within NASA and by its contractors, the Committee 
receives the impression that changes often may only be considered which 
will reduce risks to that level which has been previously accepted in the 
STS program. The Committee believes that such risks, accepted in the 
past, logical as that may have appeared to be at the time, should not new 
be accepted without a concentrated effort to plan and implement a program 
to remove or reduce these risks. 


FUTURE WORK 

The Committee is continuing its audit by examining other aspects of the 
STS risk management process. Among these are the design qualification and 
flight certification processes; a further look at integrated systems 
analysis; launch commit criteria and waiver policy? the process for 
generating, reviewing, revising and approving the retention rationale for 
waivers to permit flight of the Shuttle with critical items that affect 
safety; the process for structural analysis, establishment of margins, and 
verification of analyses and margins; the risk management process for STS 
software? and the process for analyzing the effect of payloads on the 
safety of the Shuttle, ground personnel, and flight crews. 

We plan to issue a final report of the Committee late this year. It will 
include our assessment of all of the procedures reviewed and recommenda- 
tions for improvement of the STS risk management system. If it should 
appear desirable, we will provide another interim letter report to convey 
findings and recommendations which may emerge from the reviews now under 
way. 


Sincerely yours, 

M L 

Alton D. Slay 
Chairman 

Committee on Shuttle Criticality 
Review and Hazard Analysis Audit 


cc: Admiral Richard H. Truly 



APPENDIX D 

PROBABILISTIC RISK ASSESSMENT 


1. THE APPROACH TO QUANTITATIVE 
RISK MANAGEMENT 

The output of a quantitative risk management 
function is a quantification and prioritization of 
issues, the controlling of which leads to optimal 
decisions involving safety, reliability, quality, per- 
formance, and cost. The approach is to implement 
a methodology that interprets, synthesizes, and 
integrates all elements of a product assurance 
program into a form suitable for decision making. 
The input would be the results from the various 
safety, reliability, and quality assurance programs 
of the field offices. The transformation of this 
information into a useful basis for decision making 
is the step that enables meaningful risk management 
to occur. 

The National Aeronautics and Space Adminis- 
tration (NASA) has a variety of documents covering 
the approach to be taken in the discipline areas of 
safety, reliability, maintainability, and quality as- 
surance. These documents, subject to revisions, 
would be the basic guides to be implemented by 
the various centers. It is the task of the risk 
assessment function to systematically process the 
output of the centers into a form suitable for 
meaningful risk management. The key require- 
ments for this critical information processing and 
assessment step are as follows: 

• The figures of merit must be explicit and 
quantitative. 

• The information processing must be based on 
an integrated systems engineering approach 
(see also Section 5.11). 

• The quantification of uncertainty must be an 
integral part of the information processing 
(see also Appendix E). 

• The contributors to risk must be explicit, 
prioritized, and defined in terms that enable 
measurable corrective actions. 

• Finally, the results should provide the basis 
for rational analysis of alternatives for reduc- 
ing and controlling risk. 

The logic engine for carrying out the information 
processing is a risk-based model of each space 


system. The model should be structured to give 
perspective to the importance of the various tasks 
associated with the product assurance activity. The 
model must be a living model with continuous 
input into and from the design process. While this 
approach probably is not warranted in many cases, 
such as small automated spacecraft, it should be 
considered in large, complex programs — especially 
those with potential risk to human life — such as 
the STS or the Space Station. 

2 . TWO KINDS OF CONFIDENCE 

The essential objective of the risk management 
effort is “confidence” — confidence that each space 
mission will perform substantially as planned, and 
confidence that it will not be destroyed or rendered 
significantly less useful by accidents or unforeseen 
problems (including excessive cost). Now, what is 
meant by confidence? One way we humans increase 
our confidence is to believe that we are highly 
competent. We shall call this “psychological” con- 
fidence. It can be extremely important for the 
effectiveness of an organization. NASA has done 
an excellent job in this area in the past, and this 
needs to continue. 

There is another kind of confidence that we shall 
call “engineering” confidence. This comes from in- 
depth understanding of the system under consid- 
eration, from deep knowledge of the design and 
testing program, and from knowing how to achieve 
quality in manufacturing, maintenance, operation, 
and flight readiness. 

There is another dimension to this notion of 
gaining engineering confidence. This comes from 
acknowledging that nothing ever built by man is 
100% reliable. It comes from knowing that risks 
are always present. The objective, therefore, is to 
know just how large the risk is. Thus, engineering 
confidence and success come not from eliminating 
risk, which is impossible, but from controlling it 
and managing it. That means knowing what it is — 
measuring it, knowing its size, shape, structure, 
etc. — and taking steps to reduce the risk to ac- 
ceptable levels. Thus, the idea of engineering con- 
fidence is essentially equivalent to the quantification 
of risk. This equivalence makes engineering confi- 
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dcncc an objective quantity, as distinct from psy- 
chological confidence, which is subjective. Psycho- 
logical confidence is a matter of good feeling. 
Engineering confidence is objectively and logically 
related to the evidence available — to the informa- 
tion, experience, test data, calculations, and, in- 
deed, to the consensual judgments of the experts 
involved. Engineering confidence is the quantitative 
expression of that evidence. That expression is 
formulated according to strict, logical, invariable 
rules. It is not a matter of opinion or mood. 

When a satisfactory level of engineering confi- 
dence has been established, then those involved in 
the program indeed will have a “good feeling.” 
Therefore, engineering confidence produces psy- 
chological confidence. The reverse, as we know 
too well, is not necessarily true. 

3. HOW IS CONFIDENCE GAINED OR 
REGAINED? 

The public and Congress, based on past tech- 
nological failures in the nation’s space programs, 
are probably not going to be moved by psycholog- 
ical confidence in the future. Engineering confidence 
needs to be created. The issue of quantification 
needs to be faced. Those responsible for a program 
such as the NSTS need to be willing to ask 
themselves: “How confident are we that this design, 
this mission, this launch will succeed?” This is a 
powerful question, if it is properly used. How is 
this question used properly? The first step is to 
provide the format in which the answer is to be 
given. This makes the question into a workable 
tool. 

The proposed format is as follows, taking the 
STS as an example: Let us project ourselves into 
the future to a time when we can imagine that 
many thousands of Shuttle missions have been 
launched. One can now look back at the record 
and ask the following question: “In what fraction 
of these launches was the vehicle lost?” Let this 
fraction be 4 >/ X)V . This parameter would then be a 
very meaningful figure of merit describing the 
success, safety, and effectiveness of the program. 

At the present time, of course, the numerical 
value of this parameter is not known. One can 
only tell the state of knowledge about what this 
value will be. This is done in the form of a 
probability density curve against 4> f _ ov , using a 
logarithmic scale, as shown in Figure D-l. 


PROBABILITY 
DENSITY 

<P L OV 

FIGURE D-1 State of knowledge probability curve 
for frequency of loss of vehicle. 

This curve expresses the current knowledge about 
4>/,ov based on all the information and evidence 
available. The width of the curve reflects the degree 
of uncertainty about the value of <t>/.ov- The whole 
shape and location of the curve is a portrayal of 
the current state of confidence in the vehicle. 
Therefore, this “state of knowledge” curve can be 
adopted as the format for quantitative expression 
of confidence. This curve is also the bottom-line 
output of a risk analysis of the vehicle. 

With curves of this type, together with an orderly 
compilation of the evidence on which the curve is 
based, NASA can build confidence in a tangible 
form. They can then communicate it convincingly 
to the whole technical and management team, and 
also to Congress, to review committees, and to the 
public at large. 

4. DOCUMENTING CONFIDENCE 
THROUGH A QUANTITATIVE RISK MODEL 

At any point during the life of a project it is 
desirable to be able to reach for a document that 
presents the current risk status of the project in a 
compact, succinct, and quantitative form. This 
document should contain the bottom-line figures 
of merit and the numbers, tables, graphs, and 
diagrams that would capture and characterize the 
risk of the project. It also should make clear the 
main contributors to risk and the main sources of 
unreliability, doubt, and uncertainty at that time. 

The document, which might be called the Risk 
Summary Report, would be updated regularly and 
might be the basic document upon which the risk 
management function would draw. It would con- 
tain in an organized way the combined knowledge 
of the entire technical team on issues of risk. It 
would spell out what is known and not known on 
each point and would quantify all uncertainties so 
that decision makers could clearly understand the 
trade-offs among costs, benefits, and risks. 

Such a document can only be generated as the 
summary output report of an ongoing quantitative 
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risk model (QRM) of the project. This model and 
this report, properly handled, could become an 
extremely useful mechanism, a primary channel for 
communication between management and the tech- 
nical team. Indeed, it could become an important 
framework and mechanism for communication and 
coordination among all parts of the technical team. 
If used in this way, the report would make a major 
contribution to the success of the project. 

The Risk Summary Report may be thought of 
as the final stage of an information machine. This 
machine is depicted in Figure D-2 as a kind of 
megaphone. At the right end in the figure are 
represented the working levels of the project and 
the design, fabrication, testing, and research or- 
ganizations. The information from all these activ- 
ities, relevant to risk, is continually gathered into 
the machine at the right. This information is 
digested and processed, through the logic of the 
QRM, and emerges finally as the Risk Summary 
Report. 

The primary information How is thus from right 
to left in this figure. However, there is also a very 
important reverse flow, a kind of “back EMF." 
The fact that this machine exists, that it is orga- 
nizing and processing the information in certain 
ways, and that people are reading the output in 
certain ways, exerts a valuable orderly discipline 
on the working levels. Questions move from left 
to right, forcing the working levels to continually 
structure and organize their data and their thinking 
about risk. 

If the information machine is properly con- 
structed, it establishes not only an orderly calcu- 


lating and recording mechanism but, perhaps even 
more importantly, it establishes a language and a 
conceptual framework that unifies and organizes 
the thinking, communication, and decision making 
of the whole project. Not only are better design 
decisions thus made, but enormous savings in time 
and talent can result simply from the fact that 
everybody is using the same language so that, to a 
great extent, all participants mean the same things 
by the same words. 

The QRM approach can provide an extremely 
valuable integrating framework for the Safety, 
Reliability, and Quality Assurance (SR&QA) ac- 
tivities. This framework would include the Failure 
Modes and Effects Analyses (FMEA) and hazard 
analysis work, w r hich would become in effect part 
of the QRM. Indeed, one of the benefits of the 
QRM approach is that it would help to ensure that 
the results of the FMEA and hazard work are fully 
recognized and acted on at the decision level. One 
of the w r ays this benefit is achieved is through the 
discipline of quantification, which forces the major 
items to the surface, wTere attention must be paid 
to them. A second way is through the quantification 
of uncertainty, an even more stringent discipline, 
w hich forces an organization (for example), before 
it dismisses an item as an “acceptable" risk, to 
show quantitatively that the evidence available 
provides sufficient confidence to support that de- 
cision. The quantification of uncertainty also helps 
decision makers to know when a change in the 
hardware is needed or when the problem is just 
lack of confidence — so that perhaps more testing 
is needed, rather than new designs. 


RISK REPORT PROPER 



BACK EMF 


FIGURE D-2 The Risk Summary Report as the final stage of an information machine. 
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5. THE ELEMENTS OF PROBABILISTIC 
RISK ANALYSIS 

5.1 The “Set of Triplets” Definition of Risk 

In contemplating the design or operation of a 
project, those involved should say to themselves: 
“We know how things are supposed to work out; 
we know our plan. Now we would like to know 
what are the possible departures from that plan.” 
Specifically, they would ask three questions: 

• What can go wrong? 

• What is the likelihood of that happening under 
the current plan? 

• If it does happen, what are the consequences; 
i.e., what is the damage? 

The answers to these questions constitute a risk 
and reliability analysis. The answers might be 
arranged in a table as in Figure D-3. The first 
column contains descriptions and names of scen- 
arios. This is the answer to the first question above. 
The second column contains the likelihoods, /„ of 
the scenarios, s r Here we use the word likelihood 
in a generic sense. How to quantify likelihood will 
be discussed in Section 5.2. The third column 
contains “damage index,” x t , which is a measure 
of the consequences of the /th scenario. 

Each row of the table thus constitutes a triplet 

<5„ /„ *,•> 

giving a scenario, its likelihood, and consequences. 
This triplet constitutes then one answer to the three 
questions. The table itself, i.e., the set of all triplets 

ANSWERS TO: (1) WHAT CAN GO WRONG? 

(2) WHAT IS THE LIKELIHOOD? 

(3) WHAT IS THE DAMAGE? 


SCENARIO 

LIKELIHOOD 

DAMAGE 

*1 

*i 

X 1 

s 2 

C 2 

x 2 

s 3 

e 3 

x 3 

S N 

C N 

X N 


R = RISK » |< Cj, x j > } 

FIGURE D-3 Quantitative definition of risk. 


denoted by the outer brackets, provides the total 
risk; in particular, 

R = {<s„ /„*, >} 

is the complete answer to the questions. Therefore 
this set of triplets is adopted as the definition of 
risk, R. 

This definition becomes the organizing principle 
for the QRM and, thus, for the SR&QA work on 
the project. What is being sought in this work is 
the identification of all possible significant scenarios 
and the characterization of their likelihood and 
consequences. 

5.2 Quantifying Likelihood 

The idea of likelihood can be expressed quanti- 
tatively in different ways. For NASA-type risk work 
the most useful way might be what is called the 
“probability of frequency” approach. In this ap- 
proach, one can imagine a “model” in which a 
vehicle is launched, or a facility operated under 
specified conditions many, many times. In this 
thought experiment the scenario, s„ will occur with 
a certain “frequency,” which is denoted <J>„ and 
which is measured in occurrences per mission, per 
launch, per year, or other appropriate unit. 

These frequencies (\> t may be thought of as 
abstract in the sense that, since the experiment 
cannot be run completely, the cannot be meas- 
ured precisely. The <jy actually are parameters of 
the model and they can be usefully adopted as 
figures of merit indicating the safety and reliability 
of the system. 

We would like then to know the numerical values 
of these parameters, <f>,. As mentioned above, these 
values will never be known precisely. However, we 
are not totally at a loss either. There is always a 
certain body of evidence and information relevant 
to these values. So now one can ask, “What 
inferences can be drawn from this evidence about 
the values of these parameters, and with what 
degrees of confidence can those inferences be drawn?” 

The answers to this question can be expressed 
in the form of probability curves against the pos- 
sible values of the parameters (as in Figure D-l). 
These curves are called state of knowledge curves. 
They become the final quantitative expression of 
risk and reliability. 
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The remaining question is how these curves are 
developed from evidence available, considering that 
the evidence may be of very differing types: test 
data, actual flight experience, calculations, judg- 
ment of experts, experience of other similar equip- 
ment, etc. 1 he answer is that the development of 
these curves makes heavy use of the fundamental 
theorem of inference, Bayes theorem. The use of 
this theorem is partly art and partly science, but it 
always can be done in a way that is meaningful 
for decision making purposes. 

In order for the individual state of knowledge 
curves on the <{>,’s to be a complete specification of 
the knowledge available, certain assumptions must 
be made. One is that the scenarios are approxi- 
mately mutually exclusive; i.c., only one can happen 
at a time. Another is that conditional on the data, 
different <4>/s are statistically independent. If these 
assumptions are not satisfied, more complex ap- 
plications of Bayes theorem are required. However, 
for this discussion, we make these simplifying 
assumptions. 

5.3 Structuring and Categorizing the Triplets 

Since the number of possible scenarios for a 
system can be very large, it is important in carrying 
out a Probabilistic Risk Assessment (PRA) to or- 
ganize and categorize the set of triplets. This can 
be done in many ways. 

Perhaps the most important categorization of 
triplets is by the magnitude of the consequent 
damage. For this, one wants to know what scen- 
arios lead to destruction or inactivation of the 
space mission. What is the total probability of such 
scenarios? What scenarios lead to substantial de- 
creases in the system’s performance or usefulness? 
What is the probability of that outcome? 

A second way would be to categorize scenarios 
by the part of the system complex in which they 
originate. This would give us a picture of the risk 
of the various elements and subsystems. Another 
important way of looking at the problem is to 
categorize the triplets by the phase of the flight in 
which they take place, thus making visible the risks 
attendant on each flight phase. 

5.4 Pictorial Representation of Risk 

It may be useful for some purposes to express 
the damage x, on an index scale, [0, 100]. The 


value x, - 0 represents no damage and the value 
x, = 100 represents loss of vehicle (LOV). Inter- 
mediate values of x, represent partial loss of mission 
or vehicle. With this idea a useful pictorial pres- 
entation of risk can be developed in the following 
way: In the risk table. Figure D-3, the scenarios 
can be numbered in order of increasing damage; 
that is, such that 

*. + i - 

and let N be the total number of scenarios. Then 
we can define 

$(*.) = X 4 >. • 

/ = ' 

Thus defined, <!>(*, ) is the total frequency of all 
scenarios having damage level x, or greater. 

If these fpfxr,) are plotted on a log scale versus x, 
and the resulting step-function is smoothed, a curve, 
(f>(x) vs. x, is obtained which is known variously 
as the “risk curve”, the Rasmussen curve, or the 
“frequency of exceedance” curve as in Figure 
D-4. Its ordinate over any x is the frequency with 
which scenarios occur having damage equal to or 
greater than x. This curve also may be viewed as 
a figure of merit of the system. 

As before, since the <j>, is not known exactly, one 
will not know the risk curve exactly. But from the 
uncertainty in the individual 4>„ the uncertainty in 



FIGURE D-4 Risk curve. 


c|>(x) can be calculated. This uncertainty can then 
he presented in the form of a family of risk curves 

{<fy,(x):0<P^ 1} , 

shown, for example, in Figure D-5. This graph is 
called a “risk diagram.” For a fixed x, the uncer- 
tainty about <t>(x) can be quantified by 

iV{<l>(x) < <t>,, (x)} = P . 

Suppose, for example, that 100) = 10 -2 . This 
means a confidence level of 99% that the frequency 
of LOV [i.e., <t>(100)] is less than or equal to .01. 

From a portrayal of such risk diagrams one can 
gain a rapid understanding of the contributions 
that various sources make to the overall risk of a 
system or program. 

5.5 Use of Risk Diagrams in Decision Making 

Like everything else in life, large engineered 
systems, such as the STS, necessarily involve a 
degree of risk. In the case of engineered systems, 
however, intelligent design decisions can control 
the amount of risk. Sometimes through a flash of 
insight it is possible to change or simplify a design 
in a way that not only reduces risk but also improves 
performance and reduces the cost. 1 his does hap- 
pen, and these arc happy occasions. More often, 
however, the situation is that risk can be made, in 
principle, as small as one likes, but the price for 
this is diminished performance and increased cost 
of the system. 

The task of management, therefore, is to strike 
an optimal balance between risk, cost, and per- 


FREQUENCY OF 
EXCEEDANCE 





formance. The balance is struck and fine-tuned 
continuously through day-to-day decisions, as the 
design evolves. In the “flash of insight” cases, the 
decisions are easy to make. In the more usual case, 
trade-offs are required. In these situations, it is 
useful and necessary to have quantitative input so 
that the amount of risk can be weighed against the 
levels of cost and performance. 

The situation in such cases is portrayed in Figure 
D-6, which shows the anatomy of a general decision 
problem. Each option brings with it a certain risk, 
cost, and performance. If these three factors were 
precisely known, it would be easy to make the 
decision. What makes that problem interesting in 
real life is that these factors are never known with 
complete certainty. It is important, then, to quantify 
these uncertainties as part of the input to the 
decision analysis. 

Figure D-6 shows the uncertainties in cost and 
performance quantified in the form of probability 
curves. Each option, therefore, can be characterized 
by triplet <C, B, R> diagrams. The decision maker 
must then choose which triplet (i.e., which option) 
he prefers. In the language of decision theory his 
degree of preference, as a function of the triplet, is 
called a utility function, U. 

The rule of quantitative risk analysis, as shown, 
is to provide the assessment of risk, including 
uncertainty, as part of the input to decision prob- 
lems. Strictly speaking, PRA per se is limited to 
the risk part of the problem, but the same quan- 
titative way of thinking, the same probabilistic 
methodology, can be and should be applied to the 
cost and performance factors as well. 

5.6 Assembly and Disassembly of Risk 

5.6 . 1 Identifying Scenarios 

According to the definition of risk noted above, 
the first and most important step in risk assessment 
is to identify the scenarios. In this connection, the 
following are some key ideas. First of all, note that 
any scenario that can be described is actually a 
category of scenarios. Thus, “the pipe breaks” is 
a category that includes as sub-categories, “the 
pipe breaks longitudinally,” “there is a double- 
ended guillotine break,” “the pipe breaks in such 
and such location,” etc. 

A second point is that since the objective is to 
identify all possible significant scenarios, any method 
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FIGURE D-6 Decision model. 


that helps one do that is good. Any new way of 
looking, any new way of categorizing that helps 
to he sure that no significant scenarios have been 
overlooked is good, so it is perfectly acceptable to 
use more than one approach to scenario identifi- 
cation. 

One approach that is quite useful is to break the 
overall engineered system into parts and subparts. 
Each part can be examined in detail and the 
questions asked: “What can go wrong with this 
part? What scenarios can originate here?” This 
approach would seem to be particularly appropri- 
ate for space systems. “Parts” could be interpreted 
successively as physical segments of the total sys- 
tem, as functional subsystems in the system; they 
could also mean different phases of the system’s 
mission life. Again, all different ways are helpful. 

Another point of interest is that some scenarios 
arc single-event scenarios. Something fails and the 
system is damaged or destroyed. Other scenarios 
require several different events to happen coinci- 
dentally, sometimes referred to as multiple failures. 
Other scenarios are “chains” of events. These are 
“cascade” or “domino” scenarios. Something hap- 
pens initially and because of that something else 
fails, which causes a chain of propagating events 
resulting in overall system failure. 


Each of these types of scenarios reqires its own 
type of analytical tools. Failure modes and effects 
analyses (FMEAs) are useful for single-event scen- 
arios; event trees and event sequence diagrams for 
chains of event-type scenarios; and fault trees for 
coincident failures. In space systems and missions, 
one can expect all these types of scenarios to be 
present and expect all these analytic tools, and 
others, to be useful. The specific mix of methods 
and approaches should be determined by what is 
contributing to the risk. 

5 . 6.2 Quantification of Scenarios 

In a methodology that has worked well, long 
run frequency is used as the measure of likelihood 
of the scenario. Thus, an underlying Poisson-type 
random process model is used as the framework 
for discussing the risk and reliability behavior of 
the system. The scenario frequencies are then viewed 
as parameters in the Poisson model, and these 
parameters arc used as figures of merit to indicate 
the safety and reliability of the system. 

The values of these scenario frequencies are 
determined from the frequencies of all the com- 
ponent events (the “elemental” events) in the scen- 
ario, such as failure of valves, pumps, human errors, 
etc. The results of the modeling logic are thus to 
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express the frequencies of the scenarios in terms 
of the frequencies, of these elemental events, 

...) ( 1 ) 

Now, the discipline of data analysis and statistical 
inference is applied. The question is asked: How 
big are the numbers A,-? Again, the state of knowl- 
edge probability curves are used to provide the 
answer (see Figure D-7). 

These curves must reflect all of the evidence and 
information available which arc relevant to the A,: 
all operating experience, test data, calculations, 
etc. In putting together this information, the logic 
of Bayes theorem is used to help evaluate and 
combine the various types of evidence correctly. 
The discipline of this theorem forces one to organize 
and codify the evidence and helps to curb wishful 
thinking. 

To apply Bayes theorem one needs two basic 
ingredients. The first ingredient is a "‘prior" state 
of knowledge curve P n/ (A,) which quantified the 
available qualitative information about k r Quali- 
tative information may be in the form of precise 
know ledge of related components or expert engi- 
neering judgement. The fact that this qualitative 
information can be quantified as a probability 
density is the major result of the theory of subjective 
probability that has been developed since the 1 950’s. 

The second ingredient is the “likelihood func- 
tion" associated with the available data that con- 
tains information about A,. These data could be 
industry data, test data, and/or field data. Let D 
- (D,, D 2 , • • •) be the vector of data available. 
The likelihood function, L(k n D ), is proportional 
to the conditional probability of observing the data 
D given A,. For example, if the data are observed 
defects, then the likelihood function may be derived 
from the Poisson distribution. 

Bayes theorem integrates these sources of infor- 



FIGURE D-7 State of knowledge probability curve 
for elemental parameter \ r 



FIGURE D-8 State of knowledge probability curve 
for scenario frequency, 

mation. The state of knowledge curve for A, given 
all information is P,(\,), which is proportional to 

/% (A/) LA ; , D) . 

The proportionality constant is chosen so that 
P,(A ; ) is a probability density (i.e., it integrates 
to 1). 

Having the curves P,( A ; ), they can now be “prop- 
agated" through equation (1) to obtain curves for 
the cf) ; (Figure D-8). Finally, since the total loss-of- 
vehicle frequency is the sum of the d>„ 

4>/.ov = ^ ( T > 

the curves /\(<b,) (through a mathematical convo- 
lution) are simply aggregated to obtain a new 
curve, P l (4>/o\ ), for the LOV frequency. This curve, 
in relation to the initial curve, P 0 (cj>/ .ov) from Figure 
D-l, might appear as in Figure D-9. Curve P { is a 
more satisfactory state of knowledge than P (> and 
thus is a better basis for a “go" decision. 

This aggregation should be done in stages, so 
they can be viewed at various levels of aggregation 
such as system, subsystem, unit. In this way, one 
could answer macroscopic questions like: “What 
is the total frequency of events that could destroy 
or inactivate the system?" By proceeding down- 
ward in the aggregation, one could then see, at 
successively greater levels of detail, where the bulk 
of this frequency is coming from. This draws 
management’s attention to the aspects of the design 
needing further attention. 



FIGURE D-9 States of knowledge (confidence) be- 
fore and after PRA. 
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5.6. 3 Design Improvement 

The improvement between curves P<> and P { in 
Figure D-9 is simply an improvement in knowledge 
and confidence coming from study and analysis 
(PRA). It does not reflect any actual changes to the 
design of the system. If one now recognizes that, 
in the course of such a study and analysis, many 
areas of the design or maintenance/operation prac- 
tices will surely be discovered where we can do 


better, and if those improvements are then imple- 
mented, the probability curve will change again, 
hopefully to something like the curve P 2 in Figure 
D-10. 

With repeated cycles of this type of analysis and 
with continued experience and technology im- 
provement, one may hope ultimately to achieve 
something like curve P^, which perhaps is what is 
needed to support a viable manned space program. 



FIGURE D-10 Evolutionary system improvements are reflected in changes 
in the state of knowledge curves. 
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APPENDIX E 


AN IMPROVED CRITICAL ITEM RISK ASSESSMENT PROCEDURE 

FOR THE 

NATIONAL SPACE TRANSPORTATION SYSTEM 
(With an Example of Application to the 51-L Field Joints) 


1. INTRODUCTION 

On May 28, 1987, a NASA representative made 
a presentation to the Committee on Shuttle Criti- 
cality Review and Hazard Analysis Audit entitled, 
“Critical Items List (CIL) Prioritization.” The method 
discussed was subsequently issued in modified form 
as NSTS Instruction 22491, Reference (3). This 
Instruction for the preparation of Critical Item Risk 
Assessments (CIRA) provides a method for prior- 
itizing the failure modes in the CIL. It contains 
many excellent ideas and is a significant step 
forward. However, the Committee has some con- 
cerns and some related suggestions on how to 
simplify and clarify the method. 

This Appendix also contains in Section 5 an 
example of the application of trend analysis and 
Probabilistic Risk Assessment (PRA) to the pre- 
Challenger O-rings. This application, included here 
only as an example of some applicable analysis 
techniques , makes heavy use of modern statistical 
science and Bayesian ideas. 

2. CONCERNS WITH THE CURRENT 
METHOD 

The Committee’s concerns with the CIRA method, 
as currently formulated, can be summarized as 
follows: 

1. In Table 1 of Reference [3] (shown here in 
Attachment 1) the column labeled “SEVER- 
ITY” DEFINITIONS really contains worst- 
case damage states. 

2. In Table 1, the columns labeled SUCCESS 
PATHS and STATUS CODE FOR REDUN- 
DANCY/BACKUP are really descriptions of 
system or subsystem architectures. They affect 
risk by affecting the probabilities in the last 


two columns. However, the relevant informa- 
tion is in the probabilities themselves — not in 
the architecture. Any guidelines written on 
how to assess the probabilities, either empir- 
ically or subjectively, should contain much 
discussion on how success paths, redundancy 
structure, and periodic checking strategy af- 
fects the probabilities in columns 4 and 5. 

3. The probabilities in the last tw r o columns of 
Table 1 are qualitative and open to interpre- 
tation as to what the terms “Very Likely,” 
“Likely,” “Unlikely,” and Very Unlikely,” 
mean. The two columns, which have the same 
qualitative scale, appear to have different 
quantitative scales associated with them. In 
column 4, “Very Unlikely” appears to mean 
something like <10 6 and “Very Likely” 
means something like 10 '. In column 5, the 
scale depends on whether or not there is 
redundancy . If there is no redundancy, then 
“Very Unlikely” means something like 10 2 
and “Very Likely” means something like 
greater than .95. But if there is redundancy, 
then “Very Unlikely” may mean 10 \ With 
the qualitative definitions of probability, it is 
quite possible that two engineers working on 
two failure modes with the same severities 
and probabilities would assign them to dif- 
ferent probability categories and therefore 
produce inconsistent priorities. It is very im- 
portant that the probabilities have opera- 
tional definitions. Terms like “Unlikely” are 
not operational definitions. 

4. There is no way to produce a unique priority. 
Suppose there are two failure modes, and 
Table 1 is filled out as follows: 


Failure 

Mode 

Severity 

Definition 

Success 

Paths 

Redundancy/ 

Backup 

Design 

Confidence 

Likelihood of 
Worst Case 

1 

(A) —Loss 
of Life 

0 

(a)— None 

(II)— Likely 

(iv) — Unlikely 

2 

(A)— Loss 
of Life 

0 

(a) — None 

(IV)— Unlikely 

(it)— Likely 
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Which one should have the highest priority? 
Suppose that the last two columns were 
replaced by the following structure: 


Failure 

Mode 

Probability of 
Failure 

Probability of 
Worst Case 
Given Failure 

Probability of 
Worst Case 

1 

Likely = .01 

Unlikely = .01 

.0001 

2 

Unlikely = .00001 

Likely - 5 

.000005 


Now it is clear that failure mode 1 presents 
a higher risk. 

3. PROPOSED IMPROVEMENTS 

As an improvement to Reference (3), the Com- 
mittee proposes the procedure described in Table 
E-l below: 

All failure modes with the same Worst Damage 
State Given Lack of Redundancy or Redundancy 
Failure would be ranked by column Z. 

The probabilities shown in Table E-l are for 
illustration only and do not reflect any specific 
example. In actual application, it would be highly 
desirable for the analyst to include confidence limits 
(or the equivalent) for each of the probabilities 
listed in the tables produced through the CIRA. 
The Committee recommends strongly that such 
probabilities be documented by a rationale. Many 
of the facts mentioned in the current CIL “Rationale 
for Retention” would be cited in the probability 
rationale — but in the quantitative manner illus- 
trated by the example in Section 5. In addition, 
facts that imply higher probabilities would also be 
analyzed. For example, the long-run frequency of 
catastrophic failure for solid rocket motors of a 


mature design is 1/50; and therefore 1/25 for two 
solid rocket motors. A dis-aggregation of this 
frequency by failure mode would be a useful 
baseline for an analysis. How are our design and 
failure modes different from history? For example, 
the field joint is similar to Titan III, but also 
different. The redundant O-ring points to a smaller 
probability, but the insulation geometry points to 
a higher probability. 

In Table E-l, failure mode 3 has the most risk, 
even though it is only a Criticality 1R item. For 
this case, the computation of column W uses the 
following estimates: 

(i) There is one success path remaining after 
the primary failure. 

(ii) The availability of the backup is not readily 
detectable and is checked every third flight; 
and the estimated availability is .99. 

(iii) The probability of a secondary failure is 
.05. 

The formula for column W is 

W = FrjBackup Available} x Pr {Secondary Failure} 

+ Pr{Backup not Available} 

= (.99) (.05) + (.01) 

= .0595 . (1) 

For failure mode 1, there is no backup; but, it 
is a relatively rare (probability = .001) failure 
mode and infrequently (probability = .01) causes 
the worst damage state. 

Failure mode 2 is much less risky. The compu- 
tation of column W uses the following estimates: 

(i) There is one success path remaining after 
the first failure. 


TABLE E-1 Improved Risk Assessment Procedure 


T 

U 

V 

w 

X 

Y 

Z = (V)(W)(Y) 

Failure 

Mode 

Criticality 

Probability of 
Primary Failure 
During Mission 

Probability of 
Redundancy Failure. 
Given 

Primary Failure 

Worst 

Damage State. 
Given Lack of 
Redundancy or 
Redundancy Failure 

Probability of 
Worst 

Damage State, 
Given Lack of 
Redundancy or 
Redundancy Failure 

Probability of 
Worst 

Damage State 
Event 

1 

1 

001 

1 

(A) --Loss of Life 
and/or Vehicle 

.01 

.00001 

2 

1R 

.001 

001999 

(A)— Loss of Life 
and/or Vehicle 

.1 

.0000001999 

3 

1 R 

.01 

.0505 

(A)— Loss of Life 
and/or Vehicle 

1 

.000595 
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(ii) The backup is readily detectable and fixed 
when failed and the availability of the backup 
is .999. 

(iii) Given the backup, the probability of sec- 
ondary failure is .001 — the same as the 
primary. 

Use of equation (1) in this case yields 

W = (.999)(.00 1 ) + (.001) 

= .001999 . 

4. RELATIONSHIP BETWEEN IMPROVED 
PROCEDURE AND TABLE E-l 

There is a strong relationship between the im- 
provements described in Section 3 and NASA’s 
Table 1 (Attachment 1 here). From the “SEVER- 
ITY” DEFINITIONS in column 1 of Table 1, we 
can deduce the following Worst Damage States: 

A. Loss of Life and/or Vehicle 

B. Mission is Aborted 

C. Degraded Operational Capability or Early 
Mission Termination or Damage to a Vehicle 
System 

D. Loss of Some Operational Capability of Ve- 
hicle, but Full Mission Duration. 

E. No Operational Effect 

The probability scales could be set up as categories 
with the definitions given in Table E-2. 

The Committee urges the use of quantitative 
definitions of probability. Even though for some 
failure modes the probabilities will be assessed 
subjectively, it is very important that the analyst 
have an operational definition. To reiterate, terms 
like “Unlikely” are not operational definitions. In 


addition, use of a quantitative probability scale 
will augment the pure engineering judgment ap- 
proach. 

The factors in Reference [3], Section 3.4, are 
very relevant to assessing the Probability of Primary 
Failure During Mission in Table E-l. Other factors 
include: 

• Product design certification test results 

• Manufacturing process qualification test re- 
sults 

• Engineering analytical models 

• Related industry data 

• Etc. 

The number of SUCCESS PATHS and the 
REDUNDANCY/BACKUP scenarios given in 
NASA’s Table 1 (Attachment 1 to this appendix) 
are very relevant to assessing the Probability of 
Redundancy Failure Given Primary Failure in Table 
E-L 

The factors relevant to assessing the Probability 
of Worst Damage State Event in Table E-l are very 
similar to those listed in Reference [3], Section 3.5. 
As part of the exercise of assessing this probability, 
one could list all the events subsequent to redun- 
dancy failure that do not lead to the worst damage 
state. 

5. APPLICATION TO THE O-RINGS 

Only as an example to illustrate the foregoing 
proposal, consider the field joint O-rings prior to 
the Challenger flight 51-L at a joint temperature 
of 31°F, which was predicted for the Challenger 
flight. It is based only on a limited knowledge of 
the subject derived from References [1] and [2], 


TABLE E-2 Probability Scales For Improved Risk Assessment Procedure 


Description 

Center Point of Ranges of Probability Values 

Probability of 
Primary Failure 
During Mission 

Probability of 
Redundancy Failure 
Given 

Primary Failure 

Probability of 
Worst 

Damage State 
Given Lack of 
Redundancy 
or Redundancy Failure 

Very Likely 

10 ' 

10 1 

1.0 

Likely 

10 

10 ■’ 

.5 

Possible 

10 ’ 

10 j 

10 1 

Unlikely 

10 - 

10 ; 

10 11 

Very Unlikely 

10 ' 

10 •' 

10 5 
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and thus must be viewed ONLY AS AN ILLUS- 
TRATION OF A PROCESS. 

To keep things simple, only one failure scenario is 
considered. In the language of Table E-l we have: 


TABLE E-3 Application of Table E-1 to the SRM 
Field Joint 


Language of Table E-1 

Application to Field Joint 

Primary failure 
during mission 

Erosion and blowby 
of the primary O-ring 

Redundancy failure given 
primary failure 

Failure of the secondary 
O-ring given erosion and 
blowby of the primary O-ring 

Worst damage state 

Loss of life and vehicle 


The reason for considering this scenario is that 
data are readily available. Also, in Reference fl], 
p. 135, it is stated that bypass erosion or blowby 
was considered much more serious than just im- 
pingement erosion. 

The data set used in this analysis (see Attachment 
2) is taken from pages 129-131 of Reference [1], 
The subset of these data used here involves only 
the actual flights and only the field and nozzle 
joints. A useful organization of this subset is shown 
in Attachment 3. In the columns labeled “erosion,” 
“blowby,” and “erosion or blowby,” the blanks 
mean that the event did not occur. In the column 
labeled “blowby given erosion,” the blank means 
there was no erosion and the zero means that there 
was erosion but no blowby. Most of the data are 
for the primary O-rings; but the data with an 
asterisk are for the secondary O-rings. 

5.1 Primary Failure 

For primary O-nng failures, we consider the 
scenario of erosion and blowby. The primary failure 
probability is: 

Pr{Primary Failure} = PrjPrimary Erosion} 

Primary I 
Erosion J . (2) 

The vertical bar in the probability expression (2) 
reads “conditional on.” So, for example, 

PrjBlowby | Erosion} 

would read, “probability of the event Blowby, 
conditional on the event Erosion occurring.” For 


two events A and B, a fundamental law of prob- 
ability is 

Pr{A and B} = Pr{A} x Pr{B | A} . 

5.1. 1 Primary Erosion 

A plot of the incidents of field joint primary O- 
rings with erosion is shown in Attachment 4. For 
example, flight 51-C, in January 1985, had two 
field joints with primary O-ring erosion; this mis- 
sion experienced a joint temperature of 53° F and 
a leak check pressure of 200 psi. The fitted curves 
are derived from a statistical model which allows 
for possible joint temperature and leak check pres- 
sure effects. 

Flight 5 I -C experienced both erosion and blowby 
of the field joint. At a subsequent Flight Readiness 
Review where 51-C was discussed, there was a 
concluding statement, “Low temperature enhanced 
probability of blow-by” (Reference fl], p. 147). 
On page H-73 of Reference [2], it is stated that, 
“Frequency of O-ring damage has increased since 
the incorporation of . . . higher stabilization pres- 
sures in leak test procedures . . .”. So it is of interest 
to statistically model the effect of temperature and 
leak check pressure on O-ring anomalies. 

Let 

p(T s) - Probability of erosion per field joint 
primary O-ring, 

where 

t = Joint temperature 
s = Leak check pressure. 

The assumptions for this statistical model are: 

1. The model for p{t, s) is: 

ln {r^b} - “ + 0' + ^ • < 3 > 

This is called a Logistic Regression model. The 
variables a, 3,7 are unknown parameters to be 
estimated from the data. Different values of these 
parameters represent different relationships be- 
tween erosion probability and (temperature, 
pressure). For example, if 3 < 0, then probability 
decreases with temperature; but if 3 > 0, then 
probability increases with temperature. We will 
let the data determine which of these is most 
likely. 

2. Given p(t, s), the field joints are statistically 
independent. 


X Pr 

I Blowby 
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Let 


x(t,s) — Number of field joint primary O-rings 
with erosion for a launch with joint 
temperature t and leak check pressure s. 

Under these assumptions, the probability distri- 
bution of x(t, s) given p(t, s) is binomial with 
parameters n = 6 (i.e., 6 field joints) and p - p(t, 
s). So for k = 0, 1, . . . , or 6, 

Pr|x(t,s) = k |p(/,s) j 

= j^J rp(Ls)]*[i - p(t,s)}<' * 

Let the subscript i represent the ith launch in 
Attachment 3. So / = 1,2,..., 23. Let 

x, = Number of field joint primary 
O-rings with erosion 
t, = Joint temperature 
s, = Leak check pressure 
Pi = P(t„S t ) 

Also let 

X = (A',,x 2 , . . 

t — U 1 1 . . ., /> 0 

s = (s | , S^, . . ., 

The likelihood function, L, given the data x, is 
defined as the probability of observing x conditional 
on /, s, and (ot,p,7). The variables t and s are 
regarded as known variables (in standard regression 
analysis they are called independent variables); and 
(a,p,y) are the unknown parameters. The likeli- 
hood function is regarded as a function of (a,p,y) 
and is 

L(a,0,-y) = n( x 6 )pttl “ Pif x ‘ • 

Recall that p, is a function of (ot,0,7)- 
The maximum likelihood estimates of the (a, p,^) 
are those values that maximize the likelihood 
function. In effect, they are the values of (a, (3, 7) 
that make the observed value of x the most probable 
under our model. 

There is a close relationship between maximum 
likelihood estimation and least squares. The least 
squares estimates of (a,P,\) are those values that 
minimize 

X ( x , - 6py , 

1 - 1 


where 6 p, is the expected value of x t under our 
model. If the x's had a Gaussian (normal) distri- 
bution with common variance, then the maximum 
likelihood estimates and the least squares estimates 
would be the same. This is because the Gaussian 
probability density would then be monotonically 
related to the sum of squares above. However, the 
probability densities of the x, y s in our problem are 
binomial and not Gaussian. And it is a well 
established fact in statistical science that maximum 
likelihood estimation is usually more efficient (closer 
to the truth) than least squares; so we use maximum 
likelihood. 

The results of a maximum likelihood analysis of 
these data under the above model yields the values 
in Table E-4. 


TABLE E-4 Maximum Likelihood Analysis of the SRM 
Field Joint Primary O-Ring Erosion Data 


Parameter 

Maximum Likelihood 
Estimate 

90% Confidence 
Interval 

(V 

7.8 

[-1. 15.7] 

[i 

.17 

[ •• 28. - 06] 

y 

0024 

[- 012. 016] 


The 90% Confidence Interval reveals the fact 
that from our data we cannot learn the “true” 
value of (a,p,\) with great precision. For example, 
a Bayes interpretation of the interval [ — .28, — .06] 
for the temperature effect, P, is that given our data, 
there is a .9 probability that the “true” value of p 
lies in the interval [-.28, -.06]. Note that this 
interval does not include the value P = 0 (i.e., no 
effect). This means that the temperature effect is 
“statistically significant;” or that there is only a 
very small probability that the true value of P is 
greater than or equal to zero. 

Also note that there is no statistically significant 
pressure effect on field joint erosion. That is because 
most of the variation is explained by temperature 
variation. This is curious, because in Reference [1], 
blow-holes caused by high pressure were cited as 
a cause of erosion. 

Plugging the maximum likelihood estimates into 
equation (3) yields 

'"[r -pMWi ] = 7 - 8 - ( ' l7) ' + <- 0024 >< 200 > 
= 8.3 - ( .17)t . 
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This implies 


pit, 200 ) 


c l*..? M7)/| 

J _j_ c l«.3 M7)?| * 


Plugging (5) and (6) into (2) yields 

Pr{Primary Failure} = (.95) (.292) 
(4) = .277 


The curve for 200 psi (plotted in Attachments 4 
and 5) is (6)p(/,200), because there are 6 field 
joints. 


The predicted probability per joint of primary 
O-ring erosion at 31° F joint temperature and 200 
psi leak check pressure is 


p(3 1,200) = .95 


Probability of 
Primary Erosion 


( 5 ) 


The 90 percent confidence interval for the “prob- 
ability of primary O-ring erosion” is shown in 
Attachment 5 and is [.5, 1.0], This shows that the 
extrapolation to 31° F introduces considerable 
uncertainty in the estimate. The propagation of 
this uncertainty to the final result will be discussed 
in Section 5.5. 


5./. 2 Primary Blotvby Given Primary Erosion 

The frequencies per primary O-ring of blowby 
given erosion were extracted from Attachment 3 
and are given in Table E-5. An analysis of the 
blowby given erosion data shows no statistically 
significant effects of joint type, joint temperature, 
or leak check pressure. So we use the estimate 

p i Primary Blowby Primary Erosion I 
1 for Field Joint for Field Joint j 



Primary Blowby 

Primary Erosion 

Pr< 

for Eield or 

for Field or 


Nozzle Joint 

Nozzle Joint 


= .292 (6) 

TABLE E-5 Frequency per Primary O-Ring of 
Blowby Given Erosion 


Joint 

Frequency 
per O-Ring 

Field 

2 - = ,286 

Nozzle 

| = 294 

Field plus 
Nozzle 

h - 292 
1 


It is revealing to look at the frequency of primary 
O-ring blowby, given no erosion, in Table E-6. 

TABLE E-6 Frequency per Primary 0-Ring of 
Blowby Given No Erosion 


Joint 

Frequency 
per O-Ring 

Field 

i = ' 50 

Nozzle 

T ' 20 

Field plus 
Nozzle 

1 

CD 

00 

CM 

II 

CM 1 N. 


Comparison with Table E-5 shows that there is 
a strong statistical dependence between primary 
O-ring erosion and blowby — particularly for the 
field joint. For the field joint, blowby was rare 
(frequency ~ .015) when there was no erosion, 
but not rare (frequency = .286) when there was 
erosion. So 

Pr{Blowby j Erosion} >?> Pr{Blowby | No Erosion}, 

which implies strong statistical dependence. If blowby 
and erosion were statistically independent, then 
these two conditional probabilities would be the 
same. 

The strong statistical dependence shown above 
suggests that erosion might be a causal factor for 
blowby. This idea is born out by field data and 
various experiments. Experiments (reference [2], p. 
H-82) showed that an O-ring will fail to seal with 
an erosion depth of 0.15 inches. In flights 51-C 
and 51-B, there was both erosion and blowby of 
the field primary O-ring, and a heat effect or erosion 
of the secondary O-ring. In both cases, the erosion 
of the primary O-ring was among the worst ero- 
sions experienced (reference [2], p. FI-71, FT72) as 
measured by cross-sectioned depths of 0.038 and 
0.171 inches, cross-sectioned perimeters of 130° 
and 360°, and a top view of affected lengths of 
58.75 and 12 inches. This implies that blowby can 
be caused by excessive erosion. So our model that 
the higher the probability of primary O-ring ero- 
sion, the higher the probability of primary O-ring 
blowby, is plausible. 
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5.2 Probability of Secondary Failure 

Next we consider the Probability of Redundancy 
Failure Given Primary Failure in Table E-l. This 
would be failure of the secondary O-ring. Our 
model of secondary failure is secondary erosion 
and failure given primary erosion and blowby. 
Therefore, 

j Secondary Primary Erosion I 
r | Failure and Blowby j 


= Pr 


Secondary 

Erosion 


Primary Erosion 
and Blowby 


X l>r( S'condary 
\ Failure 


Secondary 

Erosion 


(7) 


A statistical analysis of secondary erosion given 
primary erosion and blowby shows no statistically 
significant effects of joint type, joint temperature, 
or leak check pressure. So we use the estimate from 
Table E-7 below: 


Pr< 


Secondary Erosion 
for Field Joint 


Primary Erosion and 

Blowby 

for Field Joint 


= PrJ 


Secondary Erosion 
for Field or 
Nozzle Joint 


Primary Erosion and 
Blowby for Field 
or Nozzle Joint 


= .286 . 


( 8 ) 


TABLE E-7 Frequency per SRM Joint of 
Secondary 0-Ring Erosion Given Erosion and 
Blowby of the Primary 0-Ring in 23 Flights Prior to 
Challenger 51 -L 



Secondary Erosion 

Joint 

Given Primary Erosion and Blowby 

Field 

r 50 1 

Nozzle 

r 20 

Field plus 


Nozzle 

j = .286 

1 



The estimation of 


l> r [ Secondary 
[ Failure 


Secondary 

Erosion 


in equation (7) presents some difficulties because 
there were no secondary failures before 51-L. So 
we shall express the solutions parametrically in 
terms of the parameter 

\ 4 = Pr{Secondary Failure|Secondary Erosion} (9) 

The state of knowledge curve (described in Appen- 
dix D) for \ 4 could be determined on the basis of 
engineering information. Examples of relevant en- 
gineering information which was available before 
51-L are: 

1. Joint rotation created doubt about the ability 
of the secondary O-ring to seal. In fact the 
O-ring failure mode was considered Critical- 
ity 1, not Criticality 1R. So, officially, the 
FMEA did not recognize the secondary O- 
rings as providing redundancy. However, ac- 
cording to Reference [1], p. 126, NASA 
management and Thiokol still considered the 
joint to be a redundant seal because there 
were flights where the primary O-ring failed 
and the secondary O-ring sealed in accord- 
ance with its design intent. 

2. In July 1985, a Thiokol engineer, in light of 
the 51-B nozzle joint secondary O-ring ero- 
sion, expressed his concern that if the same 
scenario should occur in a field joint (and he 
believed it could), then it would be a “jump 
ball” as to the success or failure of the joint 
because the secondary O-ring could not re- 
spond to the clevis opening rate and might 
not be capable of pressurization (i.e., in the 
51-L design, which has been changed in the 
redesigned joint). (See Reference [1], p. 139.) 

3. The qualitative assessment (Reference [2], p. 
H-84, Chart 166) of the probability that the 
field joint secondary O-ring will fail given 
erosion penetration of the primary O-ring 
seal is listed in Table E-8. 


TABLE E-8 Qualitative Probability of SRM 
Secondary 0-Ring Failure Given Erosion Penetration 
of Secondary 0-Ring 


Time After Ignition 

Qualitative Probability of 
Secondary O-Ring Failure 

Ignition Transient: 


0 to 1 70 ms 

low 

170 to 330 ms 

medium 

330 to 600 ms 

high 

Steady State: 

high 

60 ms to 2 min 
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4. There were only two incidents of secondary 
O-ring erosion in a field joint. So there was 
no solid statistical evidence that the secondary 
O-ring would work given primary O-ring 
failure; i.e., nothing like 1,000 successes with- 
out a failure. Also, as seen in Table E-8, the 
probability of secondary O-ring failure de- 
pends on time after ignition. 


5. The night before the Challenger launch, a 
chart provided to NASA by a Thiokol engi- 
neer about the possible temperature effect on 
the O-rings (Reference [1], p. 89, Chart 2-2) 
included concerns that: (i) lower temperature 
of the O-rings would result in a change in 
th eir sealing timing function w'hich would 
result in higher O-ring pressure actuation 
time; (ii) if the actuation time increases, 
threshold of secondary seal pressurization 
capability is approached; (iii) if threshold is 
reached, then secondary seal may not be 
capable of being pressurized. 


Plugging (8) and (9) into (7) yields 


Pr 


Secondary 

Failure 


= (.286)A 4 


/ Probability of 
\ Secondary Failure 


( 10 ) 


5.3 Probability of Worst Damage State Given 

Redundancy Failure 

If the field joint seal were to fail, there is some 
possibility that the crew and vehicle would survive. 
For example, the seal might fail right before the 
solid rocket motors completed their burn. How- 
ever, the chances are very high that such a failure, 
should it occur, w'ould be earlier in the flight. This 
suggests a value approaching 1 for the probability 
of loss of life and vehicle given total seal failure. 
Thus, the closest probability value of 1 from Table 
E-2, column Probability of Worst Damage State, 
is selected in this example. 

5.4 Probability of Worst Damage State Event 

Using the estimates derived above, the value for 
column Z in Table E-l is 

^ ~ (*27/ ) (.286 ) \ 4 /Probability per Joint \ 

\of Worst Damage / 

= (.0792) A 4 . (11) 


5.5 Probability of At Least One Field Joint Failure 

The estimated probability in Section 5.4 is for 
only one field joint. The estimated probability of 
field joint failure for the mission is 

p r I Mission Field 
[ Joint Failure 

No Field I 
Joint Failures J 

= 1 - [1 — (.0792) X 4 ] 6 

(Probability of Failure) (12) 

It is clear from the statistical analyses that there 
is uncertainty in the estimates of the probabilities 
used. For example, the 90 percent confidence in- 
tervals in Table E-4 show that the parameter 
estimates are uncertain. Also, the .286 estimate in 
equation (8) w'as based on two failures out of 
seven, and is therefore uncertain. The uncertainty 
associated with equation (12) is quantified in At- 
tachment 6. The two almost linear curves form a 
90 percent confidence interval for the “probability 
of mission field joint failure,” conditional on the 
value of \ 4 . So if the value of \ 4 is .25, for example, 
then the conditional 90 percent confidence interval 
is [0.010, .1 18]. 

A subject matter expert could analyze the rele- 
vant engineering information and assess a state of 
knowledge curve for 4. If this curve were centered 
on \ 4 = .25 with a considerable variance, then the 
unconditional 90 percent confidence interval for 
the “probability of mission field joint failure,” 
would be much wider than the [.010, .118] interval 
cited above. 

The 90 percent confidence intervals in Attach- 
ment 6 were derived by a Bayesian analysis (see 
Appendix D for more discussion). For the 51-L 
environment (e.g., 31° F), we define the following 
long run “true” frequency probabilities: 

0 = Probability of mission field joint failure 
per mission; and for a given field joint, 

4> = Probability of failure 

= Probability of primary O-ring erosion 

A 2 — Probability of primary O-ring blowby 
given primary O-ring erosion 

M = Probability of secondary O-ring erosion 
given primary O-ring erosion and 
blowby 

\ 4 = Probability of secondary O-ring failure 
given secondary O-ring erosion. 



132 


Our model is that 0 = 1 — ( 1 — <{>)** (13) 

4 

4> - II X, (14) 

/ — I 

LetA = X,\>\, (15) 

then 0 = 1 — [ 1 — AX 4 ] 6 . (16) 


In the Bayesian analysis we assume that, condi- 
tional on our data, X,, X 2 , and X, are statistically 
independent. This is reasonable because the X,’s 
are successive conditional frequencies. The state of 
knowledge curves for the individual X,’s were 
derived from Bayesian analyses assuming “flat” a 
priori state of knowledge curves. This means that 
we did not use much information external to the 
data in Attachment 3. For example, we made no 
attempt to use the engineering models described 
in, e.g.. Reference [2], p. H-60. This may have 
been possible by modeling the uncertainties in the 
variables of the engineering models. This idea was 
suggested by Feynman (Reference [2], Appendix 
F). 1 he uncertainties in the engineering models arc 
a possible explanation as to why the models did 
nor predict very well. 

Finally, the state of knowledge curve for A was 
derived by propagating the state of knowledge 


curves for the X,’s through equation (15). This was 
done by a discrete probability approximation tech- 
nique. The implied 90 percent confidence interval 
for A is [.007, .082], 

The upper and lower curves in Attachment 6 are 
derived from equation (16) and are 

0„(X 4 ) = 1 - [1 - (.082) X 4 ] f ’ 

0/(X 4 ) = 1 — [1 — (.007) X 4 ] 6 . (17) 
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TABLE 1 


ATTACHMENT 1 NASA’s Proposed CIRA Technique. 
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ATTACHMENT 2 O-RIng Anomalies Compared with Joint Temperatures and Leak Check Pressure 


Flight 

(Solid 


Pressure 



Joint 

or 

Rocket 

Joint/ 

(In psl) 




Temp. 

Motor Date 

Booster) 

O-RIna 

Field Nozzle 

Erosion 

Blowbv 

°F 

DM-1 07/1 8/77 

_ 

- 

NA 

NA 

- 

- 

84 

DM-2 01/18/78 

- 

- 

NA 

NA 

- 

- 

49 

DM-3 10/19/78 

- 

- 

NA 

NA 

- 

- 

61 

DM-4 02/17/79 

- 

- 

NA 

NA 

- 

- 

40 

QM-1 07/13/79 

- 

- 

NA 

NA 

“ 


83 

QM-2 09/27/79 


- 

NA 

NA 

- 

- 

67 

QM-3 02/13/80 

- 

- 

NA 

NA 

- 

- 

45 

STS-1 04/12/81 

- 

- 

50 

50 

- 

- 

66 

STS-2 11/12/81 

(Right) 

Aft Fleld/Prtmary 

50 

50 

X 

- 

70 

STS-3 03/22/82 

- 

- 

50 

50 



69 

STS-4 06/27/82 

unknown: 

hardware lost at sea 

50 

50 

NA 

NA 

80 

DM-5 10/21/82 

- 

- 

NA 

NA 

- 

- 

58 

STS-5 11/11/82 

- 

- 

50 

50 

- 

- 

68 

OM-4 03/21/83 

- 

Nozzle/Primary 

NA 

NA 

X 

- 

60 

STS-6 04/04/83 

(Right) 

Nozzle/Primary 

50 

50 

(1) 

- 

67 


(Left) 

Nozzle/Primary 

50 

50 

(1) 

- 

67 

STS-7 06/18/83 

- 

- 

50 

50 

- 

- 

72 

STS -8 08/30/83 

- 

- 

100 

50 

- 

- 

73 

STS-9 11/28/83 

- 

- 

100(2) 

100 

- 

- 

70 

STS 41 -B 02/03/84 

(Right) 

Nozzle/Primary 

200 

100 

X 

- 

57 


(Left) 

Forward Field/ 








Primary 

200 

100 

X 

- 

57 

STS 41 -C 04/06/84 

(Right) 

Nozzle/Primary 

200 

100 

X 

- 

63 


(Left) 

Aft Fleld/Prlmary 

200 

100 

(3) 

- 

63 


(Right) 

Ignlter/Prfmary 

NA 

NA 

- 

X 

63 

STS 41 -D 08/30/84 

(Right) 

Forward 








FlelcVPrlmary 

200 

100 

X 

- 

70 


(Left) 

Nozzle/Primary 

200 

100 

X 

X 

70 


(Right) 

Igniter/Primary 

NA 

NA 

- 

X 

70 

STS 41 -G 10/05/84 

- 

- 

200 

100 

- 

- 

78 

DM-6 10/25/84 

- 

Inner Gasket/ 








Primary 

NA 

NA 

X 

X 

52 

STS 51 -A 11/08/84 

- 

- 

200 

100 

- 

- 

67 

STS 51 -C 01/24/85 

(Right) 

Center Field/ 








Primary 

200 

100 

X 

X 

53 


(Right) 

Center Field/ 








Secondary 

200 

100 

(4) 

- 

53 


(Right) 

Nozzle/Primary 

200 

100 

- 

X 

53 


(Left) 

Forward Field/ 








Primary 

200 

100 

X 

X 

53 


(Left) 

Nozzle/Primary 

200 

100 

- 

X 

53 


Dash (-) denotes no anomaly; NA denotes not applicable. 
See end of attachment for footnotes. 
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ATTACHMENT 2 (continued) 


Flight 


(Solid 


Pressure 



Joint 

or 


Rocket 

Joint/ 

(Inpsl) 



Temp. 

Motor 

Date 

Booster) 

O-Rina 

Field 

Nozzle 

Erosion 

Blowbv 

°F 

STS 51 -D 

04/12/85 

(Right) 

Nozzle/Primary 

200 

200 

X 

- 

67 



(Right) 

Igniter/Primary 

NA 

NA 

- 

X 

67 



(Left) 

Nozzle/Primary 

200 

200 

X 

- 

67 



(Left) 

Igniter/Primary 

NA 

NA 

- 

X 

67 

STS 51 -B 

04/20/85 

(Right) 

Nozzle/Primary 

200 

100 

X 

- 

75 



(Left) 

Nozzle/Primary 

200 

100 

X 

X 

75 



(Left) 

Nozzle/Secondary 

200 

100 

X 

- 

75 

DM-7 

05/09/85 


Nozzle/Primary 

NA 

NA 

X 

- 

61 

STS 51 -G 

06/17/85 

(Right) 

Nozzle/Primary 

200 

200 

X (5) 

X 

70 



(Left) 

Nozzle/Primary 

200 

200 

X 

X 

70 



(Left) 

Igniter/Primary 

NA 

NA 

- 

X 

70 

STS 51 -F 

07/29/85 

(Right) 

Nozzle/Primary 

200 

200 

(6) 

- 

81 

STS 51-1 

08/27/85 

(Left) 

Nozzle/Primary 

200 

200 

X (7) 

- 

76 

STS 51 -J 

10/03/85 


- 

200 

200 

- 

- 

79 

STS 61 -A 

10/30/85 

(Right) 

Nozzle/Primary 

200 

200 

X 

- 

75 



(Left) 

Aft Fleld/Prlmary 

200 

200 

- 

X 

75 



(Left) 

Center Field/ 









Primary 

200 

200 

- 

X 

75 

STS 61 -B 

11/26/85 

(Right) 

Nozzle/Prlmary 

200 

200 

X 

- 

76 



(Left) 

Nozzle/Primary 

200 

200 

X 

X 

76 

STS 61 -C 

01/12/86 

(Right) 

Nozzle/Primary 

200 

200 

X 

- 

58 



(Left) 

Aft Fleld/Prlmary 

200 

200 

X 

- 

58 



(Left) 

Nozzle/Prlmary 

200 

200 

- 

X 

58 

STS51-L 

01/28/86 



200 

200 



31 


(1) On STS-6, both nozzles had a hot gas path detected In the putty with an Indication of heat on the 
primary O-rlng. 

(2) On STS-9, one of the right Solid Rocket Booster field Joints was pressurized at 200 psi after a 
destack. 

(3) On STS 41 -C, left aft field had a hot gas path detected In the putty with an Indication of heat on 
the primary O-rlng. 

(4) On a center field Joint of STS 51 -C, soot was blown by the primary and there was a heat effect on 
the secondary. 

(5) On STS 51 -G, right nozzle had erosion In two places on the primary O-rlng. 

(6) On STS 51 -F, right nozzle had hot gas path detected In putty with an indication of heat on the 
primary O-rlng. 

(7) On STS 51-1, left nozzle had erosion In two places on the primary O-rlng. 
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ATTACHMENT 3 0-Ring Anomalies Prior to Challenger 
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ATTACHMENT 4 Occurrence of Field Joint Primary O-rings with Erosion. 



ATTACHMENTS Maximum Likelihood Estimate and 90% ATTACHMENTS 90 Percent Confidence Interval for the 

Confidence Interval for the Number of Field Joint Primary "Probability of Mission Field Joint Failure, ’ as a Function 

O-rings with Erosion at 200 psi. of K *- 




Probability of Secondary O-ring Failure 
Given Secondary O-ring Erosion 



APPENDIX F 

DESCRIPTION OF PROPOSED SYSTEMS SAFETY ENGINEERING FUNCTIONS IN 
SUPPORT OF NATIONAL SPACE TRANSPORTATION SYSTEM RISK ASSESSMENT 

AND RISK MANAGEMENT 


In Section 5. 1 1 the Committee recommends that 
NASA consider bringing together appropriate ac- 
tivities into a focused “Systems Safety Engineering” 
function at both Headquarters and the centers. 
This activity would apply across the entire set of 
design, development, qualification and certifica- 
tion, and operations activities of the National Space 
Transportation System (NSTS) Program in support 
of risk assessment and risk management. Systems 
safety engineering would embrace the functions 
(listed in Section 5.1 1 and illustrated here in Figure 
F-l) which arc described briefly in the following 
paragraphs.* 

1. IDENTIFICATION of failure 

MODES AND EFFECTS 

The failure modes of each hardware item can be 
identified at this step without addressing the prob- 
ability of each failure mode occurring. All of the 
significant effects of each failure mode also would 
be identified. These effects (not just the estimated 
worst-case effect) are needed also for identification 
of hazards and for evaluating potential cascading 
influences on the failure modes of other parts of 
the system. All of the causes of each failure mode 
(including the feedback influences from the hazard 
analysis, step 3 below) should then be identified. 

I he control of all causes of each failure mode by 
design margin, process controls, redundancy, anil 
operating constraints would be defined. This in- 
formation would be an input to the analysis of 
safety risks in steps 5, 8, and 9. 

2. ESTABLISHMENT OF DESIGN 
CRITERIA FOR REDUNDANCY 

Design criteria for redundancy would be based 
on functional and fail-operational requirements for 
components or units which do not have cata- 
strophic single failure modes. These criteria would 
be based on reliability analyses of components 
using either statistical data bases where available 
or estimated failure rate functions. 


In Figure F-l, the thirteen functions discussed in this appendix are 
shown by the boxes which are numbered to correspond. This diagram 
can be compared to that currently described for the NSTS Program 
by the JSC SR&QA office, as shown in Figure 5-12 in Section 5.1 I. 


3. IDENTIFICATION OF HAZARDS AND 
THEIR POTENTIAL CONSEQUENCES 

Hazards associated with the system can be sys- 
tematically identified using various methods such 
as fault-tree or event-tree networks. Inputs will 
come from mission requirements, the system con- 
figuration, the applicable identified hardware fail- 
ure effects, human factors and the expected envi- 
ronments. Potential consequences of the presence 
of each hazard can then be derived without regard 
for the probability of the events or mishaps occur- 
ring. (However, some screening out of very low 
probability failure events would simplify this ef- 
fort.) Mishaps resulting from combinations of events 
and the impacts of created hazards on failure modes 
in other hardware can be identified. Each of the 
causes of the identified hazards, along with pro- 
posed controls, would be defined for later risk 
assessment in steps 5, 8, and 9. 

4. IDENTIFICATION OF CRITICAL ITEMS 

Using the set of information generated in the 
previous steps, hardware failure modes could be 
categorized on the basis of their potential conse- 
quences. Those designs having failure modes wfith 
consequences that could result in loss of vehicle or 
life would be returned to engineering for possible 
alternative concepts. Failure modes that remain 
after this cycle could be put into criticality cate- 
gories to be prioritized based on severity of the 
failure effects and the probability of occurrence 
(steps 8 and 9). Those in prioritized categories 
which require Level I approval for either retention 
or a waiver authorization would be submitted 
through Level II PRCB along with a full safety- 
risk assessment produced under the direction of 
NASA systems safety engineers (step 13). 

5. EVALUATION OF THE PROBABILITY 
OF OCCURRENCE OF CAUSES AND 
CONSEQUENCES OF FAILURE MODES 
AND HAZARDS 

An evaluation can be made of the probability of 
occurrence of each of the causes and consequences 
for each retained failure mode and hazard. These 
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FIGURE F-1 Flow diagram of proposed systems safety engineering functions in support of risk assessment. 
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analyses could he performed by both the contrac- 
tors' and NASA’s systems safety engineers. A va- 
riety of tools can be used to perform these evalu- 
ations. The determination of probability of 
occurrence of the causes of failures would he 
expressed as a set of functions related to: 

a. Reliability data for hardware items having 
causes of failure modes that are statistical in 
nature, such as electronic hoards. 

b. Wear-out functions for hardware line replace- 
able units where the causes of the failure 
modes are both statistical and have safety 
operating margins that are either time or 
cycle dependent. 

c. Operating margins required where the causes 
of the particular modes of hardware failure 
are dependent on stress, temperature, or other 
environmental factors to which the unit may 
be subjected. 

d. The control which can he exercised over the 
true configuration of the part, unit, sub- 
system, or system. This includes both the 
validation and control of manufacturing and 
integration processes, and the ability to ex- 
plicitly verify the configurations prior to op- 
erations. 

Evaluation of the probability of occurrence of 
each of the possible consequences of critical hard- 
ware failures or the presence of other severe hazards 
requires assessment of each path of the fault tree. 
The prevention of certain consequence paths would 
be evaluated relative to the system design and the 
specific operational hazard control techniques. 
Probability functions need to be determined for 
both the causes and consequences in order to 
provide inputs, both to the overall risk assessment 
which will guide the final design (or for the current 
STS, the proposed design changes), and to the 
criteria on which the validation and certification 
test programs should be based. 

6. ESTABLISHMENT OF SAFETY-RISK 
LEVEL CRITERIA FOR DESIGN 
MARGINS AND HAZARD CONTROLS 

Using relationships of the types derived under 
step 5 as a framework, risk levels can be allocated 
among the various subsystems, units, and compo- 
nents that would be consistent with the acceptable 
safety-risk requirements established by NASA for 
the overall NSTS program. Design criteria can then 


be established for the margins required against each 
cause of a critical failure mode (using the functions 
developed in step 5) and for the controls required 
to limit the consequences of each hazard. This task 
is critical to providing assurance that the NSTS 
system has been configured to a given (acceptable) 
set of safety-risk levels. (Note that one cannot 
assure fully safe operations.) Those risk levels 
(which may be quite different for loss of hardware 
versus loss of life) must have a definable and 
objective set of measures that can be agreed upon 
by Level I and the Administrator of NASA. They 
must later be verified during the test programs. 
Without such quantitative safety-risk level assess- 
ments, assurances of acceptable safety are not 
meaningful and the fulfillment of responsibility is 
not measurable. 

7. DESIGN OF QUALIFICATION AND 
CERTIFICATION TEST PROGRAMS 

Once safety margins have been determined for 
each failure mode of the accepted designs, quan- 
titatively significant validation, qualification, and 
(where required) time or cycle (reuse) dependent 
certification test programs can be designed. These 
test plans must be optimized to extract the maxi- 
mum amount of information on operating margins 
against critical failure modes from the most cost 
effective quantity of hardware and the time period 
which can be allocated to tests. Design of the test 
programs is crucial to the viability of making risk 
assessments. The criteria for the tests should be 
established by reliability and/or systems safety 
engineers who specialize in test program design 
and statistical analysis of test data. 

8. OBJECTIVE ASSESSMENT OF SAFETY 
RISKS 

The test data should be statistically analyzed to 
establish credible validated margins against the 
causes of each significant potential failure mode. 
When these measured margins are compared with 
the margin criteria from step 6, and when the 
probability functions for configuration control (step 
5. d) are derived, there will be a meaningful basis 
for making assessments of the probability of oc- 
currence for each failure mode and its associated 
hazard. These probabilities of occurrence must be 
combined with the appropriate analyses of the 
probabilities of the consequences being realized for 
each failure at the subsystem and total system levels 
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to provide an objective measure of the portions of 
the overall safety-risks that are associated with 
each retained design and hazard. 

9. DEVELOPMENT OF ACCEPTANCE 
RATIONALE FOR RETAINED HAZARDS 
AND HAZARD REPORTS 

Rationales for accepting the safety risks associ- 
ated with all created and intrinsic hazards would 
be developed. For those hazards caused by hard- 
ware failure modes, these rationales would embody 
the Critical Items List retention rationales devel- 
oped by the various engineering groups and the 
test-based safety-risk assessments generated in step 
8. This information would be published as a set of 
risk assessed hazard reports. These reports would 
go through the approval and data management 
process shown in Figure F-l. Upon approval by 
Level II PRCB, they would constitute the NSTS 
Accepted Hazards Data Base. 

Those hazards in the data base which result from 
the currently defined Criticality 1 and I R items 
could then be further classified and prioritized 
based on their assessed safety risks. Those requiring 
final acceptance at Level I would have special 
request packages prepared by NASA systems safety 
engineering. To avoid the misconceptions associ- 
ated w-ith thousands of waivers to an accepted 
system design, these requests should fall into two 
categories: 

1. Items which met their specific design criteria, 
including safety-risk criteria (step 6). These 
items should not require a “waiver,” but only 
Level 1 approval of the retention requests 
because of their perceived importance or risk 
contribution. 

2. Items which did not meet their specific safety- 
risk design criteria as indicated by test mar- 
gins or detailed risk analyses. These items 
would therefore require a “waiver" for re- 
tention. 

These approval requests to Level I would be pre- 
sented in conjunction with an overall System Safety 
Assessment Report and specific Mission Risk As- 
sesssment Reports (step 13 below). 

10. SPECIFICATION OF ENVIRONMENTAL 
AND OPERATING CONSTRAINTS 

Having accepted a residual hazard (whether 
contained or catastrophic) the NASA systems safety 


engineers must specify very explicitly for all equip- 
ment levels (part, unit, subsystem, element, and 
full system) the environmental and operating con- 
straints which will assure that the validated margins 
will not be violated. In this regard, this task also 
would have a major interface with the operations 
activities. The analysis of such things as the effect 
of environmental conditions on the validity of 
validations and certifications is usually not done 
by the quality assurance engineers; therefore, the 
systems safety engineers should be the responsible 
focus for this task. 

11. QUANTITATIVE EVALUATION OF 
FLIGHT DATA TO UPDATE SAFETY 
MARGIN VALIDATIONS 

By reviewing all flight data (or other off-line test 
data and even test data from other programs) for 
explicit information, updated quantitative assess- 
ments of the validated design criteria can be made. 
In order to retain the assured level of risk as new 
data become available, specifications may have to 
be changed for some hardware or new operational 
constraints may have to be defined. 

12. OVERSIGHT OF QUALITY ASSURANCE 
FUNCTIONS TO CONTROL SAFETY-RISKS 

In order to fulfill its responsibility to assure 
control to the accepted levels of risk, the systems 
safety engineers must oversee the appropriate qual- 
ity assurance functions. This is essential because 
the validated margins and assessed risks of the 
retained hazards are dependent on total configu- 
ration verification of the overall system and each 
of its constituent parts. By “total” configuration 
one means all aspects of the hardware, software, 
external environments and operating constraints. 

13. OVERALL SYSTEM SAFETY RISK 
ASSESSMENT AND DEFINITION OF THE 
POTENTIAL TO REDUCE THE LEVEL 
OF RISK 

Using all of the above information, the NASA 
systems safety engineers can prepare a series of 
“System Safety Assessment Reports.” These reports 
would continuously update overall system risk 
assessments against the safety-risk objectives estab- 
lished for the various phases of the NSTS Program 
by the risk management activity. The systems safety 
engineers also would define the potential to reduce 
the levels of risk in the program. Mission risk 
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assessment reports would also be prepared which 
would incorporate mission accomplishment risk 
assessments, of which the safety risks would be 
one input. 

Where required, retention request packages gen- 
erated in step 9 would be submitted through Level 
II to Level I along with the approved safety-risk 
assessments for each item and an appropriate 


summary of the overall system safety-risks assess- 
ment report. Thus, the retention requests can be 
considered by Level I within the context of a 
definable and objective risk management process. 
The arguments for retention of prioritized critical 
items would be combined with objective assess- 
ments of safety-risks for each item's contribution 
to the overall system’s safety risks. 
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